跳到主要内容

2025-05-15-12-03

Improving the Reliability of LLMs: Combining CoT, RAG, Self-Consistency, and Self-Verification

Abstract

arXiv:2505.09031v1 Announce Type: new Abstract: Hallucination, where large language models (LLMs) generate confident but incorrect or irrelevant information, remains a key limitation in their application to complex, open-ended tasks. Chain-of-thought (CoT) prompting has emerged as a promising method for improving multistep reasoning by guiding models through intermediate steps. However, CoT alone does not fully address the hallucination problem. In this work, we investigate how combining CoT with retrieval-augmented generation (RAG), as well as applying self-consistency and self-verification strategies, can reduce hallucinations and improve factual accuracy. By incorporating external knowledge sources during reasoning and enabling models to verify or revise their own outputs, we aim to generate more accurate and coherent responses. We present a comparative evaluation of baseline LLMs against CoT, CoT+RAG, self-consistency, and self-verification techniques. Our results highlight the effectiveness of each method and identify the most robust approach for minimizing hallucinations while preserving fluency and reasoning depth.

摘要

幻觉现象(即大型语言模型生成自信但错误或无关信息)仍是其在复杂开放任务应用中的主要局限。思维链提示通过引导模型进行中间步骤推理,已成为改进多步推理的有效方法,但单独使用仍无法完全解决幻觉问题。本研究探讨了将思维链与检索增强生成相结合,并应用自一致性和自验证策略如何减少幻觉并提高事实准确性。通过在推理过程中引入外部知识源,并使模型能够验证或修正自身输出,我们致力于生成更准确、连贯的响应。我们对基线大型语言模型与思维链、思维链+检索增强生成、自一致性及自验证技术进行了对比评估。实验结果揭示了各方法的有效性,并确定了在保持流畅性和推理深度的同时最小化幻觉的最优方案。


ELIS: Efficient LLM Iterative Scheduling System with Response Length Predictor

Abstract

arXiv:2505.09142v1 Announce Type: new Abstract: We propose ELIS, a serving system for Large Language Models (LLMs) featuring an Iterative Shortest Remaining Time First (ISRTF) scheduler designed to efficiently manage inference tasks with the shortest remaining tokens. Current LLM serving systems often employ a first-come-first-served scheduling strategy, which can lead to the "head-of-line blocking" problem. To overcome this limitation, it is necessary to predict LLM inference times and apply a shortest job first scheduling strategy. However, due to the auto-regressive nature of LLMs, predicting the inference latency is challenging. ELIS addresses this challenge by training a response length predictor for LLMs using the BGE model, an encoder-based state-of-the-art model. Additionally, we have devised the ISRTF scheduling strategy, an optimization of shortest remaining time first tailored to existing LLM iteration batching. To evaluate our work in an industrial setting, we simulate streams of requests based on our study of real-world user LLM serving trace records. Furthermore, we implemented ELIS as a cloud-native scheduler system on Kubernetes to evaluate its performance in production environments. Our experimental results demonstrate that ISRTF reduces the average job completion time by up to 19.6%.

摘要

我们提出ELIS——一个配备迭代式最短剩余时间优先(ISRTF)调度器的大语言模型服务系统,专为高效管理剩余token数最少的推理任务而设计。当前大语言模型服务系统通常采用先到先服务的调度策略,容易导致"队头阻塞"问题。为突破这一局限,需预测大语言模型推理时长并采用最短任务优先调度策略。然而由于大语言模型的自回归特性,推理延迟预测具有挑战性。ELIS通过采用基于编码器的前沿模型BGE来训练大语言模型响应长度预测器,成功解决了这一难题。此外,我们设计了ISRTF调度策略——这是针对现有大语言模型迭代批处理特性优化的最短剩余时间优先算法。为在工业场景中评估方案,我们基于真实用户服务追踪记录研究模拟了请求流,并将ELIS实现为Kubernetes上的云原生调度系统以评估其生产环境性能。实验结果表明,ISRTF可使任务平均完成时间最高降低19.6%。


Grounding Synthetic Data Evaluations of Language Models in Unsupervised Document Corpora

Abstract

arXiv:2505.08905v1 Announce Type: new Abstract: Language Models (LMs) continue to advance, improving response quality and coherence. Given Internet-scale training datasets, LMs have likely encountered much of what users might ask them to generate in some form during their training. A plethora of evaluation benchmarks have been constructed to assess model quality, response appropriateness, and reasoning capabilities. However, the human effort required for benchmark construction is limited and being rapidly outpaced by the size and scope of the models under evaluation. Additionally, having humans build a benchmark for every possible domain of interest is impractical. Therefore, we propose a methodology for automating the construction of fact-based synthetic data model evaluations grounded in document populations. This work leverages those very same LMs to evaluate domain-specific knowledge automatically, using only grounding documents (e.g., a textbook) as input. This synthetic data benchmarking approach corresponds well with human curated questions with a Spearman ranking correlation of 0.96 and a benchmark evaluation Pearson accuracy correlation of 0.79. This novel tool supports generating both multiple choice and open-ended synthetic data questions to gain diagnostic insight of LM capability. We apply this methodology to evaluate model performance on a recent relevant arXiv preprint, discovering a surprisingly strong performance from Gemma3 models.

摘要

语言模型(LMs)持续发展,其响应质量和连贯性不断提升。鉴于其训练数据达到互联网规模,这些模型很可能已在训练过程中以某种形式接触过用户可能要求生成的大部分内容。目前学界已构建大量评估基准来测试模型质量、响应适当性和推理能力。然而,人工构建评估基准所需的精力有限,正迅速落后于被评估模型的规模和范围。此外,由人类为每个潜在关注领域构建评估基准也不现实。为此,我们提出一种基于文档群体的自动化构建事实型合成数据模型评估方法。本研究利用语言模型自身,仅以基础文档(如教科书)作为输入,自动评估领域特定知识。该合成数据基准测试方法与人工编制问题具有高度一致性:斯皮尔曼等级相关系数达0.96,基准评估皮尔逊准确度相关系数为0.79。这一创新工具支持生成多项选择和开放式合成数据问题,用以诊断语言模型能力。我们将该方法应用于评估模型在最新arXiv预印本上的表现,意外发现Gemma3系列模型展现出卓越性能。


Automated Meta Prompt Engineering for Alignment with the Theory of Mind

Abstract

arXiv:2505.09024v1 Announce Type: new Abstract: We introduce a method of meta-prompting that jointly produces fluent text for complex tasks while optimizing the similarity of neural states between a human's mental expectation and a Large Language Model's (LLM) neural processing. A technique of agentic reinforcement learning is applied, in which an LLM as a Judge (LLMaaJ) teaches another LLM, through in-context learning, how to produce content by interpreting the intended and unintended generated text traits. To measure human mental beliefs around content production, users modify long form AI-generated text articles before publication at the US Open 2024 tennis Grand Slam. Now, an LLMaaJ can solve the Theory of Mind (ToM) alignment problem by anticipating and including human edits within the creation of text from an LLM. Throughout experimentation and by interpreting the results of a live production system, the expectations of human content reviewers had 100% of alignment with AI 53.8% of the time with an average iteration count of 4.38. The geometric interpretation of content traits such as factualness, novelty, repetitiveness, and relevancy over a Hilbert vector space combines spatial volume (all trait importance) with vertices alignment (individual trait relevance) enabled the LLMaaJ to optimize on Human ToM. This resulted in an increase in content quality by extending the coverage of tennis action. Our work that was deployed at the US Open 2024 has been used across other live events within sports and entertainment.

摘要

我们提出了一种元提示方法,该方法在生成复杂任务流畅文本的同时,优化人类心理预期与大型语言模型(LLM)神经处理状态之间的相似性。通过应用代理强化学习技术,由作为评判者的LLM(LLMaaJ)通过上下文学习教导另一个LLM,使其能够通过解析生成文本中的预期与非预期特征来创作内容。为测量人类对内容创作的心理认知,用户在美国网球公开赛2024大满贯赛事发表前对AI生成的长篇文本进行了修改。如今,LLMaaJ能够通过预判并整合人类编辑行为来解决心理理论(ToM)对齐问题,在LLM文本生成过程中实现人机协同。实验结果表明,在实时生产系统中,人类内容审核者的期望与AI生成内容实现了53.8%的完全对齐(平均迭代次数4.38次)。通过将事实性、新颖性、重复性和相关性等内容特征在希尔伯特向量空间进行几何表征——结合空间体积(所有特征重要性)与顶点对齐(个体特征相关性)——LLMaaJ实现了对人类心理理论的优化。该方法通过扩展网球赛事报道的覆盖范围显著提升了内容质量。我们应用于美国公开赛2024的技术成果,已被推广至体育和娱乐领域的其他实时赛事中。


Access Controls Will Solve the Dual-Use Dilemma

Abstract

arXiv:2505.09341v1 Announce Type: new Abstract: AI safety systems face a dual-use dilemma. Since the same request can be either harmless or harmful depending on who made it and why, if the system makes decisions based solely on the request's content, it will refuse some legitimate queries and let pass harmful ones. To address this, we propose a conceptual access control framework, based on verified user credentials (such as institutional affiliation) and classifiers that assign model outputs to risk categories (such as advanced virology). The system permits responses only when the user's verified credentials match the category's requirements. For implementation of the model output classifiers, we introduce a theoretical approach utilizing small, gated expert modules integrated into the generator model, trained with gradient routing, that enable efficient risk detection without the capability gap problems of external monitors. While open questions remain about the verification mechanisms, risk categories, and the technical implementation, our framework makes the first step toward enabling granular governance of AI capabilities: verified users gain access to specialized knowledge without arbitrary restrictions, while adversaries are blocked from it. This contextual approach reconciles model utility with robust safety, addressing the dual-use dilemma.

摘要

人工智能安全系统面临双重用途困境。由于同一请求可能因发起者身份及目的不同而具有无害性或危害性,若系统仅基于请求内容进行决策,将导致部分合法查询被拒绝而有害请求被放行。为此,我们提出基于验证用户凭证(如机构 affiliation)和风险分类器(如高级病毒学分类)的概念性访问控制框架。该系统仅在用户验证凭证符合分类要求时允许响应。针对模型输出分类器的实现,我们提出一种理论方法:通过在生成模型中集成小型门控专家模块,配合梯度路由训练,实现高效风险检测,避免外部监测器存在的性能差距问题。尽管在验证机制、风险分类和技术实现方面仍存在开放性问题,本框架首次实现了AI能力的细粒度治理:验证用户可不受任意限制获取专业知识,而攻击者则被有效拦截。这种情境化方法在模型效用与强安全性之间取得平衡,成功解决了双重用途困境。


Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures

Abstract

arXiv:2505.09343v1 Announce Type: new Abstract: The rapid scaling of large language models (LLMs) has unveiled critical limitations in current hardware architectures, including constraints in memory capacity, computational efficiency, and interconnection bandwidth. DeepSeek-V3, trained on 2,048 NVIDIA H800 GPUs, demonstrates how hardware-aware model co-design can effectively address these challenges, enabling cost-efficient training and inference at scale. This paper presents an in-depth analysis of the DeepSeek-V3/R1 model architecture and its AI infrastructure, highlighting key innovations such as Multi-head Latent Attention (MLA) for enhanced memory efficiency, Mixture of Experts (MoE) architectures for optimized computation-communication trade-offs, FP8 mixed-precision training to unlock the full potential of hardware capabilities, and a Multi-Plane Network Topology to minimize cluster-level network overhead. Building on the hardware bottlenecks encountered during DeepSeek-V3's development, we engage in a broader discussion with academic and industry peers on potential future hardware directions, including precise low-precision computation units, scale-up and scale-out convergence, and innovations in low-latency communication fabrics. These insights underscore the critical role of hardware and model co-design in meeting the escalating demands of AI workloads, offering a practical blueprint for innovation in next-generation AI systems.

摘要

大型语言模型(LLM)的快速扩展揭示了当前硬件架构的关键局限性,包括内存容量、计算效率和互连带宽等方面的约束。基于2,048块NVIDIA H800 GPU训练的DeepSeek-V3展示了硬件感知的模型协同设计如何有效应对这些挑战,实现高性价比的大规模训练与推理。本文深入分析了DeepSeek-V3/R1模型架构及其AI基础设施,重点阐述了多项关键创新:提升内存效率的多头潜在注意力机制(MLA)、优化计算-通信权衡的混合专家系统(MoE)架构、充分释放硬件潜力的FP8混合精度训练,以及最小化集群级网络开销的多平面网络拓扑结构。基于DeepSeek-V3开发过程中遇到的硬件瓶颈,我们与学界和业界同仁就未来硬件发展方向展开更广泛讨论,包括精密低精度计算单元、纵向扩展与横向扩展的融合、低延迟通信架构创新等。这些见解强调了硬件与模型协同设计在满足AI工作负载日益增长需求中的关键作用,为下一代AI系统创新提供了实用蓝图。


Reproducibility Study of "Cooperate or Collapse: Emergence of Sustainable Cooperation in a Society of LLM Agents"

Abstract

arXiv:2505.09289v1 Announce Type: new Abstract: This study evaluates and extends the findings made by Piatti et al., who introduced GovSim, a simulation framework designed to assess the cooperative decision-making capabilities of large language models (LLMs) in resource-sharing scenarios. By replicating key experiments, we validate claims regarding the performance of large models, such as GPT-4-turbo, compared to smaller models. The impact of the universalization principle is also examined, with results showing that large models can achieve sustainable cooperation, with or without the principle, while smaller models fail without it. In addition, we provide multiple extensions to explore the applicability of the framework to new settings. We evaluate additional models, such as DeepSeek-V3 and GPT-4o-mini, to test whether cooperative behavior generalizes across different architectures and model sizes. Furthermore, we introduce new settings: we create a heterogeneous multi-agent environment, study a scenario using Japanese instructions, and explore an "inverse environment" where agents must cooperate to mitigate harmful resource distributions. Our results confirm that the benchmark can be applied to new models, scenarios, and languages, offering valuable insights into the adaptability of LLMs in complex cooperative tasks. Moreover, the experiment involving heterogeneous multi-agent systems demonstrates that high-performing models can influence lower-performing ones to adopt similar behaviors. This finding has significant implications for other agent-based applications, potentially enabling more efficient use of computational resources and contributing to the development of more effective cooperative AI systems.

摘要

本研究对Piatti等人提出的GovSim仿真框架进行了评估与拓展,该框架旨在评估大语言模型(LLMs)在资源共享场景中的协同决策能力。通过复现关键实验,我们验证了关于GPT-4-turbo等大模型相较于小模型性能的论断。研究同时检验了普遍化原则的影响,结果表明大模型无论是否遵循该原则都能实现可持续合作,而小模型脱离该原则则无法达成合作。此外,我们通过多项扩展研究探索了该框架在新场景中的适用性:评估了DeepSeek-V3和GPT-4o-mini等模型以检验协同行为在不同架构和模型规模间的泛化性;创设了异构多智能体环境;采用日语指令开展情境研究;构建了需通过协作缓解有害资源分布的"逆向环境"。实验证实该基准可适用于新模型、新场景及新语言,为LLMs在复杂协同任务中的适应性提供了重要见解。值得注意的是,异构多智能体系统的实验表明高性能模型能引导低性能模型采取相似行为,这一发现对其他基于智能体的应用具有重要启示,既可提升计算资源利用效率,也有助于开发更具效力的协同AI系统。


Toward Cost-Efficient Serving of Mixture-of-Experts with Asynchrony

Abstract

arXiv:2505.08944v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) architectures offer the promise of larger model capacity without the prohibitive costs of fully dense designs. However, in real-world inference serving, load skew across experts often leads to suboptimal device utilization and excessive synchronization overheads. This paper introduces Asynchronous Expert Parallelism (AEP), a new paradigm that decouples layer execution from barrier-style synchronization. By dynamically queuing tokens at each layer (referred to as μ\mu-queuing) and adaptively re-batching them on demand, GPUs avoid waiting for straggling experts and instead continuously process whichever layer is ready. This asynchronous approach mitigates two major inefficiencies in traditional expert-parallel systems: (1) idle GPU time while waiting for the hottest expert, and (2) small-batch executions on colder experts that waste memory bandwidth. We implement these ideas in a serving system called AMoE, which disaggregates attention from expert layers and uses a defragging scheduler to reduce batch fragmentation. Evaluations on prototype MoE models show that AMoE improves throughput by up to 2.7x compared to state-of-the-art baselines, incurring a manageable latency penalty and providing a cost-effective operating point. Furthermore, experiments demonstrate nearly linear scalability to multi-node settings, whereas the baseline system shows no throughput increase even when the number of GPUs is doubled.

摘要

混合专家(MoE)架构提供了在不采用完全密集设计的高昂成本前提下扩展模型容量的可能性。然而,在实际推理服务中,专家间的负载倾斜常导致设备利用率低下和同步开销过大。本文提出异步专家并行(AEP)新范式,通过解耦层执行与屏障式同步机制,采用动态令牌分层队列(称为μ\mu-排队)和按需自适应重批处理技术,使GPU无需等待滞后专家,转而持续处理就绪层。该异步方法有效缓解了传统专家并行系统的两大低效问题:(1)等待最热专家时的GPU闲置时间;(2)冷门专家小批量执行导致的内存带宽浪费。

我们在AMoE服务系统中实现了这些创新,该系统将注意力机制与专家层解耦,并采用碎片整理调度器降低批处理碎片化。原型MoE模型评估表明,相较于最先进基线系统,AMoE最高可实现2.7倍吞吐量提升,仅带来可控的延迟代价,提供了高性价比的运行方案。进一步实验证实系统在多节点环境中呈现近线性扩展能力,而基线系统在GPU数量翻倍时吞吐量仍无增长。


The Influence of Human-inspired Agentic Sophistication in LLM-driven Strategic Reasoners

Abstract

arXiv:2505.09396v1 Announce Type: new Abstract: The rapid rise of large language models (LLMs) has shifted artificial intelligence (AI) research toward agentic systems, motivating the use of weaker and more flexible notions of agency. However, this shift raises key questions about the extent to which LLM-based agents replicate human strategic reasoning, particularly in game-theoretic settings. In this context, we examine the role of agentic sophistication in shaping artificial reasoners' performance by evaluating three agent designs: a simple game-theoretic model, an unstructured LLM-as-agent model, and an LLM integrated into a traditional agentic framework. Using guessing games as a testbed, we benchmarked these agents against human participants across general reasoning patterns and individual role-based objectives. Furthermore, we introduced obfuscated game scenarios to assess agents' ability to generalise beyond training distributions. Our analysis, covering over 2000 reasoning samples across 25 agent configurations, shows that human-inspired cognitive structures can enhance LLM agents' alignment with human strategic behaviour. Still, the relationship between agentic design complexity and human-likeness is non-linear, highlighting a critical dependence on underlying LLM capabilities and suggesting limits to simple architectural augmentation.

摘要

大型语言模型(LLMs)的迅速崛起将人工智能(AI)研究导向了代理系统领域,促使研究者采用更弱化且更灵活的代理概念。然而,这一转变引发了一个关键问题:基于LLM的代理在多大程度上能复现人类的策略推理能力,尤其是在博弈论情境中。为此,我们通过评估三种代理设计——基础博弈论模型、非结构化的LLM-as-agent模型,以及整合到传统代理框架中的LLM模型——来探究代理复杂性对人工推理者表现的影响。以竞猜游戏为测试平台,我们将这些代理与人类参与者在通用推理模式和个体角色目标方面进行对比测试。此外,我们引入模糊化游戏场景以评估代理在训练分布之外的泛化能力。通过对25种代理配置下2000余份推理样本的分析表明,受人类启发的认知结构能提升LLM代理与人类策略行为的契合度。但代理设计复杂度与拟人化程度之间呈非线性关系,这凸显了其对底层LLM能力的核心依赖性,也暗示了简单架构增强的局限性。


Language Agents Mirror Human Causal Reasoning Biases. How Can We Help Them Think Like Scientists?

Abstract

arXiv:2505.09614v1 Announce Type: new Abstract: Language model (LM) agents are increasingly used as autonomous decision-makers who need to actively gather information to guide their decisions. A crucial cognitive skill for such agents is the efficient exploration and understanding of the causal structure of the world -- key to robust, scientifically grounded reasoning. Yet, it remains unclear whether LMs possess this capability or exhibit systematic biases leading to erroneous conclusions. In this work, we examine LMs' ability to explore and infer causal relationships, using the well-established "Blicket Test" paradigm from developmental psychology. We find that LMs reliably infer the common, intuitive disjunctive causal relationships but systematically struggle with the unusual, yet equally (or sometimes even more) evidenced conjunctive ones. This "disjunctive bias" persists across model families, sizes, and prompting strategies, and performance further declines as task complexity increases. Interestingly, an analogous bias appears in human adults, suggesting that LMs may have inherited deep-seated reasoning heuristics from their training data. To this end, we quantify similarities between LMs and humans, finding that LMs exhibit adult-like inference profiles (but not children-like). Finally, we propose a test-time sampling method which explicitly samples and eliminates hypotheses about causal relationships from the LM. This scalable approach significantly reduces the disjunctive bias and moves LMs closer to the goal of scientific, causally rigorous reasoning.

摘要

语言模型(LM)作为自主决策者的应用日益广泛,其需要主动收集信息以指导决策。对此类智能体而言,关键的认知能力在于高效探索和理解世界的因果结构——这是实现稳健、科学严谨推理的核心。然而,目前尚不清楚语言模型是否具备这种能力,或是否存在导致错误结论的系统性偏差。本研究采用发展心理学中成熟的"Blicket测试"范式,检验语言模型探索和推断因果关系的能力。研究发现:语言模型能可靠推断常见、直观的析取因果关系,但对非常规却证据充分(有时甚至更充分)的合取关系存在系统性困难。这种"析取偏差"在不同模型系列、规模和提示策略中持续存在,且任务复杂度增加时表现进一步下降。有趣的是,人类成人也存在类似偏差,表明语言模型可能从训练数据中继承了深层次的推理启发式方法。为此,我们量化分析了语言模型与人类的相似性,发现其推理模式与成人相似(而非儿童)。最后,我们提出一种测试时采样方法,显式地从语言模型中采样并排除因果关系的假设。这种可扩展的方法显著减少了析取偏差,使语言模型更接近科学、因果严谨的推理目标。


In-Context Learning for Label-Efficient Cancer Image Classification in Oncology

Abstract

arXiv:2505.08798v1 Announce Type: cross Abstract: The application of AI in oncology has been limited by its reliance on large, annotated datasets and the need for retraining models for domain-specific diagnostic tasks. Taking heed of these limitations, we investigated in-context learning as a pragmatic alternative to model retraining by allowing models to adapt to new diagnostic tasks using only a few labeled examples at inference, without the need for retraining. Using four vision-language models (VLMs)-Paligemma, CLIP, ALIGN and GPT-4o, we evaluated the performance across three oncology datasets: MHIST, PatchCamelyon and HAM10000. To the best of our knowledge, this is the first study to compare the performance of multiple VLMs on different oncology classification tasks. Without any parameter updates, all models showed significant gains with few-shot prompting, with GPT-4o reaching an F1 score of 0.81 in binary classification and 0.60 in multi-class classification settings. While these results remain below the ceiling of fully fine-tuned systems, they highlight the potential of ICL to approximate task-specific behavior using only a handful of examples, reflecting how clinicians often reason from prior cases. Notably, open-source models like Paligemma and CLIP demonstrated competitive gains despite their smaller size, suggesting feasibility for deployment in computing constrained clinical environments. Overall, these findings highlight the potential of ICL as a practical solution in oncology, particularly for rare cancers and resource-limited contexts where fine-tuning is infeasible and annotated data is difficult to obtain.

摘要

人工智能在肿瘤学中的应用一直受到两大限制:依赖大规模标注数据集以及需要针对特定领域诊断任务重新训练模型。为解决这些问题,我们研究了上下文学习作为一种实用替代方案——该方法仅需在推理时提供少量标注样本即可使模型适应新诊断任务,无需重新训练。我们使用四种视觉语言模型(VLM,包括Paligemma、CLIP、ALIGN和GPT-4o),在三个肿瘤学数据集(MHIST、PatchCamelyon和HAM10000)上评估了性能表现。据我们所知,这是首个比较多种VLM在不同肿瘤分类任务中性能的研究。在不更新任何参数的情况下,所有模型通过少量样本提示均获得显著性能提升,其中GPT-4o在二分类和多分类任务中分别达到0.81和0.60的F1分数。虽然这些结果仍低于全参数微调系统的上限,但证明了上下文学习仅用少量样本即可近似实现任务特定行为的潜力,这类似于临床医生基于既往病例的推理模式。值得注意的是,尽管规模较小,Paligemma和CLIP等开源模型仍展现出具有竞争力的性能提升,表明其在计算资源受限的临床环境中具备部署可行性。总体而言,这些发现凸显了上下文学习在肿瘤学中的实用价值,尤其适用于罕见癌症和资源受限场景——这些情况下模型微调难以实施且标注数据获取困难。


Self Rewarding Self Improving

Abstract

arXiv:2505.08827v1 Announce Type: cross Abstract: We demonstrate that large language models can effectively self-improve through self-judging without requiring reference solutions, leveraging the inherent asymmetry between generating and verifying solutions. Our experiments on Countdown puzzles and MIT Integration Bee problems show that models can provide reliable reward signals without ground truth answers, enabling reinforcement learning in domains previously not possible. By implementing self-judging, we achieve significant performance gains maintaining alignment with formal verification. When combined with synthetic question generation, we establish a complete self-improvement loop where models generate practice problems, solve them, and evaluate their own performance-achieving an 8% improvement with Qwen 2.5 7B over baseline and surpassing GPT-4o performance on integration tasks. Our findings demonstrate that LLM judges can provide effective reward signals for training models, unlocking many reinforcement learning environments previously limited by the difficulty of creating programmatic rewards. This suggests a potential paradigm shift toward AI systems that continuously improve through self-directed learning rather than human-guided training, potentially accelerating progress in domains with scarce training data or complex evaluation requirements.

摘要

我们证明大型语言模型能够通过自我评判有效实现自我改进,而无需依赖参考答案,这种能力源于生成与验证解决方案之间固有的不对称性。在倒计时谜题和MIT积分竞赛题的实验中,模型无需标准答案即可提供可靠的奖励信号,从而在以往难以实现的领域实现强化学习。通过实施自我评判机制,我们在保持形式验证一致性的同时获得了显著的性能提升。当结合合成问题生成技术时,我们构建了一个完整的自我改进闭环系统:模型自主生成练习题、解决问题并评估自身表现——Qwen 2.5 7B模型较基线提升8%,在积分任务上超越GPT-4o的表现。研究结果表明,LLM评判器能为模型训练提供有效的奖励信号,解决了以往因程序化奖励设计困难而受限的诸多强化学习场景。这意味着人工智能系统可能转向通过自主学习而非人工指导训练实现持续改进的新范式,在训练数据稀缺或评估要求复杂的领域有望加速进展。


CellTypeAgent: Trustworthy cell type annotation with Large Language Models

Abstract

arXiv:2505.08844v1 Announce Type: cross Abstract: Cell type annotation is a critical yet laborious step in single-cell RNA sequencing analysis. We present a trustworthy large language model (LLM)-agent, CellTypeAgent, which integrates LLMs with verification from relevant databases. CellTypeAgent achieves higher accuracy than existing methods while mitigating hallucinations. We evaluated CellTypeAgent across nine real datasets involving 303 cell types from 36 tissues. This combined approach holds promise for more efficient and reliable cell type annotation.

摘要

细胞类型注释是单细胞RNA测序分析中关键但繁琐的步骤。我们提出了一种可信赖的大型语言模型(LLM)代理——CellTypeAgent,该模型将LLM与相关数据库的验证相结合。CellTypeAgent在降低幻觉风险的同时,实现了比现有方法更高的准确率。我们在涉及36个组织303种细胞类型的9个真实数据集上对CellTypeAgent进行了评估。这种组合方法有望实现更高效可靠的细胞类型注释。


An Extra RMSNorm is All You Need for Fine Tuning to 1.58 Bits

Abstract

arXiv:2505.08823v1 Announce Type: cross Abstract: Large language models (LLMs) have transformed natural-language processing, yet their scale makes real-world deployment costly. Post-training quantization reduces memory and computation but often degrades accuracy, while quantization-aware training can recover performance at the cost of extra training. Pushing quantization to the ternary (2-bit) regime yields even larger savings but is notoriously unstable. Building on recent work showing that a bias-free, RMS-normalized Transformer with straight-through estimation can reach 1.58-bit precision, we demonstrate that simply inserting RMS normalization before every linear projection and applying a gradual, layer-wise quantization schedule stably fine-tunes full-precision checkpoints into ternary LLMs. Our approach matches or surpasses more elaborate knowledge-distillation pipelines on standard language-modeling benchmarks without adding model complexity. These results indicate that careful normalization alone can close much of the accuracy gap between ternary and full-precision LLMs, making ultra-low-bit inference practical.

摘要

大型语言模型(LLMs)革新了自然语言处理领域,但其规模导致实际部署成本高昂。训练后量化技术虽能降低内存与计算开销,却常伴随精度损失;而量化感知训练虽可恢复性能,但需额外训练代价。将量化推至三元(2位)体系可带来更大节省,但该过程 notoriously 不稳定。基于近期研究表明——采用无偏置、RMS归一化的Transformer配合直通估计法可达1.58位精度,我们证明:仅需在每个线性投影前插入RMS归一化,并应用渐进式分层量化方案,即可稳定地将全精度检查点微调为三元LLMs。该方法在标准语言建模基准测试中匹配或超越了更复杂的知识蒸馏流程,且未增加模型复杂度。这些结果表明:仅通过精细的归一化处理就能大幅缩小三元与全精度LLMs之间的精度差距,使得超低位推理具备实际应用价值。


Human-AI Collaboration or Academic Misconduct? Measuring AI Use in Student Writing Through Stylometric Evidence

Abstract

arXiv:2505.08828v1 Announce Type: cross Abstract: As human-AI collaboration becomes increasingly prevalent in educational contexts, understanding and measuring the extent and nature of such interactions pose significant challenges. This research investigates the use of authorship verification (AV) techniques not as a punitive measure, but as a means to quantify AI assistance in academic writing, with a focus on promoting transparency, interpretability, and student development. Building on prior work, we structured our investigation into three stages: dataset selection and expansion, AV method development, and systematic evaluation. Using three datasets - including a public dataset (PAN-14) and two from University of Melbourne students from various courses - we expanded the data to include LLM-generated texts, totalling 1,889 documents and 540 authorship problems from 506 students. We developed an adapted Feature Vector Difference AV methodology to construct robust academic writing profiles for students, designed to capture meaningful, individual characteristics of their writing. The method's effectiveness was evaluated across multiple scenarios, including distinguishing between student-authored and LLM-generated texts and testing resilience against LLMs' attempts to mimic student writing styles. Results demonstrate the enhanced AV classifier's ability to identify stylometric discrepancies and measure human-AI collaboration at word and sentence levels while providing educators with a transparent tool to support academic integrity investigations. This work advances AV technology, offering actionable insights into the dynamics of academic writing in an AI-driven era.

摘要

随着人机协作在教育场景中的日益普及,如何量化和理解此类互动的程度与性质成为重要挑战。本研究探讨了将作者身份验证(AV)技术并非作为惩罚手段,而是作为量化学术写作中AI辅助程度的工具,重点关注提升透明度、可解释性及促进学生发展。基于前人研究,我们将调查分为三个阶段:数据集选择与扩展、AV方法开发、系统评估。通过使用三个数据集(包括公开数据集PAN-14和墨尔本大学多门课程的学生文本),我们将数据扩展至包含大语言模型生成文本,最终形成来自506名学生的1,889份文档和540个作者归属问题。我们开发了改进的特征向量差分AV方法,用于构建学生稳健的学术写作特征画像,旨在捕捉其写作中有意义的个体特征。该方法在多种场景下进行评估,包括区分学生原创文本与LLM生成文本,以及测试其对LLM模仿学生写作风格的抗干扰能力。结果表明,增强型AV分类器能够识别文体特征差异,在词汇和句子层面量化人机协作程度,同时为教育工作者提供支持学术诚信调查的透明工具。本研究推动了AV技术的发展,为AI驱动时代的学术写作动态提供了可操作的见解。


Federated Large Language Models: Feasibility, Robustness, Security and Future Directions

Abstract

arXiv:2505.08830v1 Announce Type: cross Abstract: The integration of Large Language Models (LLMs) and Federated Learning (FL) presents a promising solution for joint training on distributed data while preserving privacy and addressing data silo issues. However, this emerging field, known as Federated Large Language Models (FLLM), faces significant challenges, including communication and computation overheads, heterogeneity, privacy and security concerns. Current research has primarily focused on the feasibility of FLLM, but future trends are expected to emphasize enhancing system robustness and security. This paper provides a comprehensive review of the latest advancements in FLLM, examining challenges from four critical perspectives: feasibility, robustness, security, and future directions. We present an exhaustive survey of existing studies on FLLM feasibility, introduce methods to enhance robustness in the face of resource, data, and task heterogeneity, and analyze novel risks associated with this integration, including privacy threats and security challenges. We also review the latest developments in defense mechanisms and explore promising future research directions, such as few-shot learning, machine unlearning, and IP protection. This survey highlights the pressing need for further research to enhance system robustness and security while addressing the unique challenges posed by the integration of FL and LLM.

摘要

大型语言模型(LLMs)与联邦学习(FL)的集成为分布式数据联合训练提供了一种前景广阔的解决方案,既能保护隐私,又能解决数据孤岛问题。然而这一新兴领域——联邦大型语言模型(FLLM)仍面临通信计算开销、异构性、隐私安全等重大挑战。当前研究主要聚焦FLLM的可行性,未来趋势预计将着力提升系统鲁棒性与安全性。本文全面综述了FLLM领域的最新进展,从可行性、鲁棒性、安全性和未来方向四个关键维度剖析挑战:系统梳理了现有关于FLLM可行性的研究,介绍了应对资源、数据及任务异构性的鲁棒性增强方法,分析了该集成带来的隐私威胁与安全挑战等新型风险,综述了防御机制的最新进展,并探讨了少样本学习、机器遗忘、知识产权保护等未来研究方向。本综述强调亟需进一步研究以提升系统鲁棒性与安全性,同时应对FL与LLM融合带来的独特挑战。


Performance Gains of LLMs With Humans in a World of LLMs Versus Humans

Abstract

arXiv:2505.08902v1 Announce Type: cross Abstract: Currently, a considerable research effort is devoted to comparing LLMs to a group of human experts, where the term "expert" is often ill-defined or variable, at best, in a state of constantly updating LLM releases. Without proper safeguards in place, LLMs will threaten to cause harm to the established structure of safe delivery of patient care which has been carefully developed throughout history to keep the safety of the patient at the forefront. A key driver of LLM innovation is founded on community research efforts which, if continuing to operate under "humans versus LLMs" principles, will expedite this trend. Therefore, research efforts moving forward must focus on effectively characterizing the safe use of LLMs in clinical settings that persist across the rapid development of novel LLM models. In this communication, we demonstrate that rather than comparing LLMs to humans, there is a need to develop strategies enabling efficient work of humans with LLMs in an almost symbiotic manner.

摘要

当前,大量研究致力于将大型语言模型(LLMs)与人类专家群体进行比较,而所谓'专家'的定义往往模糊不清或变动不居——在持续更新的LLM版本迭代中尤其如此。若缺乏适当的防护机制,LLMs可能危及历经历史沉淀形成的患者安全护理体系,这一体系始终以患者安全为核心原则。LLM创新的关键驱动力源于社区研究努力,但若继续遵循'人类对抗LLMs'的研究范式,将加速这一风险趋势。因此,未来研究必须聚焦于建立可持续适应LLM快速迭代的临床安全使用框架。本文论证指出:相较于将LLMs与人类进行对比,更需开发使人类与LLMs近乎共生协作的高效工作策略。


Optimized Couplings for Watermarking Large Language Models

Abstract

arXiv:2505.08878v1 Announce Type: cross Abstract: Large-language models (LLMs) are now able to produce text that is, in many cases, seemingly indistinguishable from human-generated content. This has fueled the development of watermarks that imprint a ``signal'' in LLM-generated text with minimal perturbation of an LLM's output. This paper provides an analysis of text watermarking in a one-shot setting. Through the lens of hypothesis testing with side information, we formulate and analyze the fundamental trade-off between watermark detection power and distortion in generated textual quality. We argue that a key component in watermark design is generating a coupling between the side information shared with the watermark detector and a random partition of the LLM vocabulary. Our analysis identifies the optimal coupling and randomization strategy under the worst-case LLM next-token distribution that satisfies a min-entropy constraint. We provide a closed-form expression of the resulting detection rate under the proposed scheme and quantify the cost in a max-min sense. Finally, we provide an array of numerical results, comparing the proposed scheme with the theoretical optimum and existing schemes, in both synthetic data and LLM watermarking. Our code is available at https://github.com/Carol-Long/CC_Watermark

摘要

当前,大语言模型(LLMs)生成的文本在许多情况下已近乎与人类创作内容难以区分。这推动了水印技术的发展,旨在以最小化模型输出扰动的方式为LLM生成文本植入"信号"。本文针对单次文本水印场景展开分析。通过带辅助信息的假设检验视角,我们系统阐述并论证了水印检测效能与生成文本质量失真之间的基础权衡关系。研究指出,水印设计的核心在于构建检测器共享辅助信息与LLM词汇表随机划分之间的耦合机制。在满足最小熵约束的最坏情况LLM下一词元分布条件下,我们的分析确定了最优耦合与随机化策略,并给出所提方案检测率的闭式表达式,以最大最小准则量化了实施成本。最后,通过合成数据与LLM水印实验的数值结果对比,将所提方案与理论最优及现有方案进行系统比较。代码已开源:https://github.com/Carol-Long/CC_Watermark


Improved Algorithms for Differentially Private Language Model Alignment

Abstract

arXiv:2505.08849v1 Announce Type: cross Abstract: Language model alignment is crucial for ensuring that large language models (LLMs) align with human preferences, yet it often involves sensitive user data, raising significant privacy concerns. While prior work has integrated differential privacy (DP) with alignment techniques, their performance remains limited. In this paper, we propose novel algorithms for privacy-preserving alignment and rigorously analyze their effectiveness across varying privacy budgets and models. Our framework can be deployed on two celebrated alignment techniques, namely direct preference optimization (DPO) and reinforcement learning from human feedback (RLHF). Through systematic experiments on large-scale language models, we demonstrate that our approach achieves state-of-the-art performance. Notably, one of our algorithms, DP-AdamW, combined with DPO, surpasses existing methods, improving alignment quality by up to 15% under moderate privacy budgets ({\epsilon}=2-5). We further investigate the interplay between privacy guarantees, alignment efficacy, and computational demands, providing practical guidelines for optimizing these trade-offs.

摘要

语言模型对齐对于确保大语言模型(LLM)符合人类偏好至关重要,但该过程常涉及敏感用户数据,引发重大隐私隐忧。尽管现有研究已将差分隐私(DP)与对齐技术结合,其性能仍存在局限。本文提出新型隐私保护对齐算法,并严格分析其在不同隐私预算和模型下的有效性。我们的框架可部署于两种主流对齐技术——直接偏好优化(DPO)和基于人类反馈的强化学习(RLHF)。通过在大规模语言模型上的系统实验,我们证明该方法实现了最先进的性能。值得注意的是,我们的DP-AdamW算法与DPO结合时,在中等隐私预算({\epsilon}=2-5)下将对齐质量提升达15%,超越了现有方法。我们进一步探究了隐私保证、对齐效能与计算需求之间的相互作用,为优化这些权衡提供了实用指导原则。


WaLLM -- Insights from an LLM-Powered Chatbot deployment via WhatsApp

Abstract

arXiv:2505.08894v1 Announce Type: cross Abstract: Recent advances in generative AI, such as ChatGPT, have transformed access to information in education, knowledge-seeking, and everyday decision-making. However, in many developing regions, access remains a challenge due to the persistent digital divide. To help bridge this gap, we developed WaLLM - a custom AI chatbot over WhatsApp, a widely used communication platform in developing regions. Beyond answering queries, WaLLM offers several features to enhance user engagement: a daily top question, suggested follow-up questions, trending and recent queries, and a leaderboard-based reward system. Our service has been operational for over 6 months, amassing over 14.7K queries from approximately 100 users. In this paper, we present WaLLM's design and a systematic analysis of logs to understand user interactions. Our results show that 55% of user queries seek factual information. "Health and well-being" was the most popular topic (28%), including queries about nutrition and disease, suggesting users view WaLLM as a reliable source. Two-thirds of users' activity occurred within 24 hours of the daily top question. Users who accessed the "Leaderboard" interacted with WaLLM 3x as those who did not. We conclude by discussing implications for culture-based customization, user interface design, and appropriate calibration of users' trust in AI systems for developing regions.

摘要

生成式人工智能(如ChatGPT)的最新进展,彻底改变了教育、知识获取和日常决策中的信息访问方式。然而,在许多发展中地区,由于持续存在的数字鸿沟,获取这些技术仍面临挑战。为帮助弥合这一差距,我们开发了WaLLM——一个基于WhatsApp(发展中地区广泛使用的通信平台)的定制AI聊天机器人。除回答问题外,WaLLM还提供多项增强用户参与度的功能:每日精选问题、后续问题建议、热门及近期查询,以及基于排行榜的奖励系统。该服务已运行超过6个月,累计接收来自约100名用户的14,700余次查询。本文介绍了WaLLM的设计方案,并通过系统日志分析理解用户交互行为。结果显示:55%的用户查询旨在获取事实信息;“健康与福祉”是最热门主题(占28%),包括营养与疾病相关咨询,表明用户将WaLLM视为可靠信息来源;三分之二的用户活动集中在每日精选问题发布后的24小时内;访问“排行榜”功能的用户互动量是未访问者的3倍。最后,我们讨论了针对发展中地区的文化定制、用户界面设计以及用户对AI系统信任度的合理校准等启示。


Generative AI for Autonomous Driving: Frontiers and Opportunities

Abstract

arXiv:2505.08854v1 Announce Type: cross Abstract: Generative Artificial Intelligence (GenAI) constitutes a transformative technological wave that reconfigures industries through its unparalleled capabilities for content creation, reasoning, planning, and multimodal understanding. This revolutionary force offers the most promising path yet toward solving one of engineering's grandest challenges: achieving reliable, fully autonomous driving, particularly the pursuit of Level 5 autonomy. This survey delivers a comprehensive and critical synthesis of the emerging role of GenAI across the autonomous driving stack. We begin by distilling the principles and trade-offs of modern generative modeling, encompassing VAEs, GANs, Diffusion Models, and Large Language Models (LLMs). We then map their frontier applications in image, LiDAR, trajectory, occupancy, video generation as well as LLM-guided reasoning and decision making. We categorize practical applications, such as synthetic data workflows, end-to-end driving strategies, high-fidelity digital twin systems, smart transportation networks, and cross-domain transfer to embodied AI. We identify key obstacles and possibilities such as comprehensive generalization across rare cases, evaluation and safety checks, budget-limited implementation, regulatory compliance, ethical concerns, and environmental effects, while proposing research plans across theoretical assurances, trust metrics, transport integration, and socio-technical influence. By unifying these threads, the survey provides a forward-looking reference for researchers, engineers, and policymakers navigating the convergence of generative AI and advanced autonomous mobility. An actively maintained repository of cited works is available at https://github.com/taco-group/GenAI4AD.

摘要

生成式人工智能(GenAI)作为一股变革性技术浪潮,正通过其无与伦比的内容创造、推理规划与多模态理解能力重塑各行业。这一革命性力量为解决工程学最重大挑战——实现可靠的全自动驾驶(尤其是L5级自动驾驶)提供了最具前景的路径。本综述对GenAI在自动驾驶技术栈中的新兴作用进行了全面而批判性的梳理:首先提炼现代生成模型(包括VAE、GAN、扩散模型与大语言模型)的原理与权衡;继而系统阐述其在图像/激光雷达/轨迹/占据栅格/视频生成以及LLM引导推理决策中的前沿应用;分类探讨合成数据工作流、端到端驾驶策略、高保真数字孪生系统、智能交通网络及向具身AI的跨领域迁移等实际应用场景。研究同时揭示了关键挑战与机遇,包括罕见场景的泛化能力、安全评估机制、有限预算实施、法规合规性、伦理问题及环境影响,并提出涵盖理论保证、信任度量、交通系统整合与社会技术影响的研究路线。通过整合这些脉络,本综述为探索生成式AI与先进自动驾驶融合的研究者、工程师及政策制定者提供了前瞻性参考。相关文献的动态维护仓库详见https://github.com/taco-group/GenAI4AD。


AI-Mediated Code Comment Improvement

Abstract

arXiv:2505.09021v1 Announce Type: cross Abstract: This paper describes an approach to improve code comments along different quality axes by rewriting those comments with customized Artificial Intelligence (AI)-based tools. We conduct an empirical study followed by grounded theory qualitative analysis to determine the quality axes to improve. Then we propose a procedure using a Large Language Model (LLM) to rewrite existing code comments along the quality axes. We implement our procedure using GPT-4o, then distil the results into a smaller model capable of being run in-house, so users can maintain data custody. We evaluate both our approach using GPT-4o and the distilled model versions. We show in an evaluation how our procedure improves code comments along the quality axes. We release all data and source code in an online repository for reproducibility.

摘要

本文提出一种通过定制化人工智能工具重写代码注释以提升多维度质量的方法。我们首先通过实证研究和扎根理论定性分析确定需要改进的质量维度,随后提出利用大型语言模型(LLM)沿这些质量维度重写现有代码注释的流程。该流程采用GPT-4o实现,并将结果蒸馏为可在本地运行的小型模型,以保障用户数据主权。我们分别评估了基于GPT-4o的方案和蒸馏模型版本,并通过实验验证了本方法在提升代码注释多维度质量方面的有效性。所有数据和源代码已发布于在线存储库以确保可复现性。


Tests as Prompt: A Test-Driven-Development Benchmark for LLM Code Generation

Abstract

arXiv:2505.09027v1 Announce Type: cross Abstract: We introduce WebApp1K, a novel benchmark for evaluating large language models (LLMs) in test-driven development (TDD) tasks, where test cases serve as both prompt and verification for code generation. Unlike traditional approaches relying on natural language prompts, our benchmark emphasizes the ability of LLMs to interpret and implement functionality directly from test cases, reflecting real-world software development practices. Comprising 1000 diverse challenges across 20 application domains, the benchmark evaluates LLMs on their ability to generate compact, functional code under the constraints of context length and multi-feature complexity. Our findings highlight instruction following and in-context learning as critical capabilities for TDD success, surpassing the importance of general coding proficiency or pretraining knowledge. Through comprehensive evaluation of 19 frontier models, we reveal performance bottlenecks, such as instruction loss in long prompts, and provide a detailed error analysis spanning multiple root causes. This work underscores the practical value of TDD-specific benchmarks and lays the foundation for advancing LLM capabilities in rigorous, application-driven coding scenarios.

摘要

我们推出WebApp1K这一创新基准,用于评估大语言模型(LLM)在测试驱动开发(TDD)任务中的表现。该基准以测试用例作为代码生成的提示和验证手段,区别于依赖自然语言提示的传统方法,强调LLM直接从测试用例解析并实现功能的能力,从而反映真实软件开发实践。该基准包含20个应用领域的1000项多样化挑战,通过上下文长度限制和多特征复杂度的约束条件,评估LLM生成简洁功能性代码的能力。研究发现,遵循指令和上下文学习能力是TDD成功的关键要素,其重要性超过通用编码能力或预训练知识。通过对19个前沿模型的全面评估,我们揭示了性能瓶颈(如长提示中的指令丢失现象),并提供了涵盖多重根源的详细错误分析。本研究强调了TDD专项基准的实用价值,为提升LLM在严格的应用驱动编码场景中的能力奠定了基础。


CEC-Zero: Chinese Error Correction Solution Based on LLM

Abstract

arXiv:2505.09082v1 Announce Type: cross Abstract: Recent advancements in large language models (LLMs) demonstrate exceptional Chinese text processing capabilities, particularly in Chinese Spelling Correction (CSC). While LLMs outperform traditional BERT-based models in accuracy and robustness, challenges persist in reliability and generalization. This paper proposes CEC-Zero, a novel reinforcement learning (RL) framework enabling LLMs to self-correct through autonomous error strategy learning without external supervision. By integrating RL with LLMs' generative power, the method eliminates dependency on annotated data or auxiliary models. Experiments reveal RL-enhanced LLMs achieve industry-viable accuracy and superior cross-domain generalization, offering a scalable solution for reliability optimization in Chinese NLP applications. This breakthrough facilitates LLM deployment in practical Chinese text correction scenarios while establishing a new paradigm for self-improving language models.

摘要

大语言模型(LLMs)的最新进展展现出卓越的中文文本处理能力,尤其在中文拼写纠错(CSC)任务中表现突出。尽管LLMs在准确性和鲁棒性上超越了传统的基于BERT的模型,但其可靠性与泛化能力仍存在挑战。本文提出CEC-Zero——一种新型强化学习(RL)框架,使LLMs能够通过自主错误策略学习实现自我纠错,无需外部监督。该方法将强化学习与LLMs的生成能力相结合,消除了对标注数据或辅助模型的依赖。实验表明,经RL增强的LLMs达到了工业级可用精度,并具备优异的跨领域泛化性能,为中文自然语言处理应用的可靠性优化提供了可扩展的解决方案。这一突破不仅推动了LLMs在实际中文文本纠错场景中的部署,同时为自改进语言模型建立了新范式。


Variational Prefix Tuning for Diverse and Accurate Code Summarization Using Pre-trained Language Models

Abstract

arXiv:2505.09062v1 Announce Type: cross Abstract: Recent advancements in source code summarization have leveraged transformer-based pre-trained models, including Large Language Models of Code (LLMCs), to automate and improve the generation of code summaries. However, existing methods often focus on generating a single high-quality summary for a given source code, neglecting scenarios where the generated summary might be inadequate and alternative options are needed. In this paper, we introduce Variational Prefix Tuning (VPT), a novel approach that enhances pre-trained models' ability to generate diverse yet accurate sets of summaries, allowing the user to choose the most suitable one for the given source code. Our method integrates a Conditional Variational Autoencoder (CVAE) framework as a modular component into pre-trained models, enabling us to model the distribution of observed target summaries and sample continuous embeddings to be used as prefixes to steer the generation of diverse outputs during decoding. Importantly, we construct our method in a parameter-efficient manner, eliminating the need for expensive model retraining, especially when using LLMCs. Furthermore, we employ a bi-criteria reranking method to select a subset of generated summaries, optimizing both the diversity and the accuracy of the options presented to users. We present extensive experimental evaluations using widely used datasets and current state-of-the-art pre-trained code summarization models to demonstrate the effectiveness of our approach and its adaptability across models.

摘要

源代码摘要生成领域的最新进展利用了基于Transformer的预训练模型(包括代码大语言模型LLMCs)来自动化并提升代码摘要的生成质量。然而,现有方法通常专注于为给定源代码生成单一高质量摘要,忽视了生成摘要可能不充分且需要替代方案的应用场景。本文提出变分前缀调优(VPT),该方法通过增强预训练模型生成多样化且准确摘要集合的能力,使用户能为给定代码选择最合适的摘要。我们的方法将条件变分自编码器(CVAE)框架作为模块化组件集成到预训练模型中,从而建模观测目标摘要的分布并采样连续嵌入作为前缀,在解码阶段引导多样化输出的生成。值得注意的是,本方法采用参数高效的设计方案,避免了昂贵的模型重训练成本,尤其在使用LLMCs时优势显著。此外,我们采用双标准重排序方法从生成摘要中筛选子集,在保证多样性的同时优化呈现给用户的选项准确性。通过广泛使用主流数据集和当前最先进的预训练代码摘要模型进行实验评估,我们验证了本方法的有效性及其跨模型的适应性。


Human-like Cognitive Generalization for Large Models via Brain-in-the-loop Supervision

Abstract

arXiv:2505.09085v1 Announce Type: cross Abstract: Recent advancements in deep neural networks (DNNs), particularly large-scale language models, have demonstrated remarkable capabilities in image and natural language understanding. Although scaling up model parameters with increasing volume of training data has progressively improved DNN capabilities, achieving complex cognitive abilities - such as understanding abstract concepts, reasoning, and adapting to novel scenarios, which are intrinsic to human cognition - remains a major challenge. In this study, we show that brain-in-the-loop supervised learning, utilizing a small set of brain signals, can effectively transfer human conceptual structures to DNNs, significantly enhancing their comprehension of abstract and even unseen concepts. Experimental results further indicate that the enhanced cognitive capabilities lead to substantial performance gains in challenging tasks, including few-shot/zero-shot learning and out-of-distribution recognition, while also yielding highly interpretable concept representations. These findings highlight that human-in-the-loop supervision can effectively augment the complex cognitive abilities of large models, offering a promising pathway toward developing more human-like cognitive abilities in artificial systems.

摘要

深度神经网络(DNNs),尤其是大规模语言模型的最新进展,在图像和自然语言理解方面展现出卓越的能力。尽管通过增加训练数据规模来扩大模型参数已逐步提升DNN的能力,但实现复杂认知能力——如理解抽象概念、推理和适应新场景等人类认知固有的能力——仍是一项重大挑战。本研究表明,利用少量脑信号的脑环路监督学习,可有效将人类概念结构迁移至DNNs,显著增强其对抽象甚至未见概念的理解能力。实验结果进一步表明,增强的认知能力在少样本/零样本学习和分布外识别等挑战性任务中带来显著性能提升,同时生成高度可解释的概念表征。这些发现凸显了人在环路的监督能有效增强大模型的复杂认知能力,为人工系统开发更类人的认知能力提供了可行路径。


SALM: A Multi-Agent Framework for Language Model-Driven Social Network Simulation

Abstract

arXiv:2505.09081v1 Announce Type: cross Abstract: Contemporary approaches to agent-based modeling (ABM) of social systems have traditionally emphasized rule-based behaviors, limiting their ability to capture nuanced dynamics by moving beyond predefined rules and leveraging contextual understanding from LMs of human social interaction. This paper presents SALM (Social Agent LM Framework), a novel approach for integrating language models (LMs) into social network simulation that achieves unprecedented temporal stability in multi-agent scenarios. Our primary contributions include: (1) a hierarchical prompting architecture enabling stable simulation beyond 4,000 timesteps while reducing token usage by 73%, (2) an attention-based memory system achieving 80% cache hit rates (95% CI [78%, 82%]) with sub-linear memory growth of 9.5%, and (3) formal bounds on personality stability. Through extensive validation against SNAP ego networks, we demonstrate the first LLM-based framework capable of modeling long-term social phenomena while maintaining empirically validated behavioral fidelity.

摘要

当前基于代理的社会系统建模方法(ABM)传统上强调基于规则的行为,这限制了其通过超越预定义规则并利用语言模型(LMs)对人类社交互动的上下文理解来捕捉细微动态的能力。本文提出SALM(社交代理语言模型框架),这是一种将语言模型集成到社交网络模拟中的新方法,在多代理场景中实现了前所未有的时间稳定性。我们的主要贡献包括:(1)一种分层提示架构,能够在超过4,000个时间步长的情况下实现稳定模拟,同时将令牌使用量减少73%;(2)一种基于注意力的记忆系统,实现了80%的缓存命中率(95%置信区间[78%,82%]),内存增长仅为次线性的9.5%;(3)对人格稳定性的形式化边界。通过对SNAP自我网络的广泛验证,我们展示了首个基于大型语言模型的框架,能够在建模长期社会现象的同时保持经验验证的行为保真度。


Air-Ground Collaboration for Language-Specified Missions in Unknown Environments

Abstract

arXiv:2505.09108v1 Announce Type: cross Abstract: As autonomous robotic systems become increasingly mature, users will want to specify missions at the level of intent rather than in low-level detail. Language is an expressive and intuitive medium for such mission specification. However, realizing language-guided robotic teams requires overcoming significant technical hurdles. Interpreting and realizing language-specified missions requires advanced semantic reasoning. Successful heterogeneous robots must effectively coordinate actions and share information across varying viewpoints. Additionally, communication between robots is typically intermittent, necessitating robust strategies that leverage communication opportunities to maintain coordination and achieve mission objectives. In this work, we present a first-of-its-kind system where an unmanned aerial vehicle (UAV) and an unmanned ground vehicle (UGV) are able to collaboratively accomplish missions specified in natural language while reacting to changes in specification on the fly. We leverage a Large Language Model (LLM)-enabled planner to reason over semantic-metric maps that are built online and opportunistically shared between an aerial and a ground robot. We consider task-driven navigation in urban and rural areas. Our system must infer mission-relevant semantics and actively acquire information via semantic mapping. In both ground and air-ground teaming experiments, we demonstrate our system on seven different natural-language specifications at up to kilometer-scale navigation.

摘要

随着自主机器人系统日趋成熟,用户将倾向于在意图层面而非底层细节上指定任务。语言作为任务规约媒介具有表达直观且高效的特性。然而实现语言引导的机器人团队协作仍需克服重大技术障碍:语言指定任务的解释与执行需要高级语义推理能力;成功的异构机器人系统必须在多视角下实现高效动作协调与信息共享;此外机器人间通信通常具有间歇性,需建立稳健策略以利用通信机会维持协作并达成任务目标。本研究提出了一种创新系统,使无人机(UAV)与无人地面车辆(UGV)能够协作完成自然语言指定的任务,并实时响应任务变更。该系统采用基于大语言模型(LLM)的规划器,对空中与地面机器人实时构建并机会性共享的语义-度量地图进行推理。我们研究了城市与乡村环境中的任务驱动导航,系统需推断任务相关语义并通过语义建图主动获取信息。在单地面机器人及空地协同实验中,我们在七种不同的自然语言任务规约下实现了千米级导航,验证了系统效能。


Focus, Merge, Rank: Improved Question Answering Based on Semi-structured Knowledge Bases

Abstract

arXiv:2505.09246v1 Announce Type: cross Abstract: In many real-world settings, machine learning models and interactive systems have access to both structured knowledge, e.g., knowledge graphs or tables, and unstructured content, e.g., natural language documents. However, most rely on either. Semi-Structured Knowledge Bases (SKBs) bridge this gap by linking unstructured content to nodes within structured data, thereby enabling new strategies for knowledge access and use. In this work, we present FocusedRetriever, a modular SKB-based framework for multi-hop question answering. It integrates components (VSS-based entity search, LLM-based generation of Cypher queries and pairwise re-ranking) in a way that enables it to outperform state-of-the-art methods across all three STaRK benchmark test sets, covering diverse domains and multiple performance metrics. The average first-hit rate exceeds that of the second-best method by 25.7%. FocusedRetriever leverages (1) the capacity of Large Language Models (LLMs) to extract relational facts and entity attributes from unstructured text, (2) node set joins to filter answer candidates based on these extracted triplets and constraints, (3) vector similarity search to retrieve and rank relevant unstructured content, and (4) the contextual capabilities of LLMs to finally rank the top-k answers. For generality, we only incorporate base LLMs in FocusedRetriever in our evaluation. However, our analysis of intermediate results highlights several opportunities for further upgrades including finetuning. The source code is publicly available at https://github.com/kramerlab/FocusedRetriever .

摘要

在许多现实场景中,机器学习模型和交互系统可以同时获取结构化知识(如知识图谱或表格)和非结构化内容(如自然语言文档)。然而大多数系统仅依赖其中一种。半结构化知识库(SKB)通过将非结构化内容与结构化数据节点相连接,弥合了这一鸿沟,从而实现了知识获取与使用的新策略。本研究提出FocusedRetriever,一个基于SKB的模块化多跳问答框架。该框架通过整合向量相似度搜索的实体检索、基于大语言模型的Cypher查询生成及成对重排序等组件,在STaRK基准测试的所有三个数据集上均超越了现有最优方法,涵盖不同领域和多项性能指标,平均首次命中率较次优方法高出25.7%。FocusedRetriever的创新性在于:(1)利用大语言模型从非结构化文本中提取关系事实和实体属性;(2)通过节点集连接基于提取的三元组和约束条件筛选候选答案;(3)采用向量相似度搜索检索并排序相关非结构化内容;(4)最终利用大语言模型的上下文理解能力对top-k答案进行排序。为保持通用性,评估中仅使用基础大语言模型,但对中间结果的分析揭示了包括微调在内的多项升级潜力。源代码已公开于https://github.com/kramerlab/FocusedRetriever。


Endo-CLIP: Progressive Self-Supervised Pre-training on Raw Colonoscopy Records

Abstract

arXiv:2505.09435v1 Announce Type: cross Abstract: Pre-training on image-text colonoscopy records offers substantial potential for improving endoscopic image analysis, but faces challenges including non-informative background images, complex medical terminology, and ambiguous multi-lesion descriptions. We introduce Endo-CLIP, a novel self-supervised framework that enhances Contrastive Language-Image Pre-training (CLIP) for this domain. Endo-CLIP's three-stage framework--cleansing, attunement, and unification--addresses these challenges by (1) removing background frames, (2) leveraging large language models to extract clinical attributes for fine-grained contrastive learning, and (3) employing patient-level cross-attention to resolve multi-polyp ambiguities. Extensive experiments demonstrate that Endo-CLIP significantly outperforms state-of-the-art pre-training methods in zero-shot and few-shot polyp detection and classification, paving the way for more accurate and clinically relevant endoscopic analysis.

摘要

基于图像-文本结肠镜检查记录的预训练为提升内窥镜图像分析提供了重要潜力,但面临非信息性背景图像、复杂医学术语及模糊多病灶描述等挑战。我们提出Endo-CLIP,一种新型自监督框架,通过增强对比语言-图像预训练(CLIP)技术应对该领域需求。该框架包含清洗、调谐和统一三阶段:(1) 剔除背景帧,(2) 利用大语言模型提取临床特征以实现细粒度对比学习,(3) 采用患者级交叉注意力机制解决多息肉歧义问题。大量实验表明,Endo-CLIP在零样本和少样本息肉检测与分类任务中显著优于当前最先进的预训练方法,为开发更精准且具临床意义的内窥镜分析奠定了基础。


Multilingual Machine Translation with Quantum Encoder Decoder Attention-based Convolutional Variational Circuits

Abstract

arXiv:2505.09407v1 Announce Type: cross Abstract: Cloud-based multilingual translation services like Google Translate and Microsoft Translator achieve state-of-the-art translation capabilities. These services inherently use large multilingual language models such as GRU, LSTM, BERT, GPT, T5, or similar encoder-decoder architectures with attention mechanisms as the backbone. Also, new age natural language systems, for instance ChatGPT and DeepSeek, have established huge potential in multiple tasks in natural language processing. At the same time, they also possess outstanding multilingual translation capabilities. However, these models use the classical computing realm as a backend. QEDACVC (Quantum Encoder Decoder Attention-based Convolutional Variational Circuits) is an alternate solution that explores the quantum computing realm instead of the classical computing realm to study and demonstrate multilingual machine translation. QEDACVC introduces the quantum encoder-decoder architecture that simulates and runs on quantum computing hardware via quantum convolution, quantum pooling, quantum variational circuit, and quantum attention as software alterations. QEDACVC achieves an Accuracy of 82% when trained on the OPUS dataset for English, French, German, and Hindi corpora for multilingual translations.

摘要

基于云服务的多语言翻译系统(如谷歌翻译和微软翻译)已实现最先进的翻译能力。这些服务本质上采用GRU、LSTM、BERT、GPT、T5等大型多语言模型或类似带有注意力机制的编码器-解码器架构作为核心。同时,新一代自然语言系统(如ChatGPT和DeepSeek)在自然语言处理的多种任务中展现出巨大潜力,并具备卓越的多语言翻译能力。然而,这些模型仍以经典计算领域为后端。QEDACVC(量子编码器-解码器注意力卷积变分电路)提出了一种替代方案,通过探索量子计算领域而非经典计算领域来研究和实现多语言机器翻译。QEDACVC引入了量子编码器-解码器架构,通过量子卷积、量子池化、量子变分电路和量子注意力等软件改造,实现在量子计算硬件上的模拟与运行。在OPUS数据集上针对英语、法语、德语和印地语语料进行多语言翻译训练时,QEDACVC达到了82%的准确率。


Evaluating GPT- and Reasoning-based Large Language Models on Physics Olympiad Problems: Surpassing Human Performance and Implications for Educational Assessment

Abstract

arXiv:2505.09438v1 Announce Type: cross Abstract: Large language models (LLMs) are now widely accessible, reaching learners at all educational levels. This development has raised concerns that their use may circumvent essential learning processes and compromise the integrity of established assessment formats. In physics education, where problem solving plays a central role in instruction and assessment, it is therefore essential to understand the physics-specific problem-solving capabilities of LLMs. Such understanding is key to informing responsible and pedagogically sound approaches to integrating LLMs into instruction and assessment. This study therefore compares the problem-solving performance of a general-purpose LLM (GPT-4o, using varying prompting techniques) and a reasoning-optimized model (o1-preview) with that of participants of the German Physics Olympiad, based on a set of well-defined Olympiad problems. In addition to evaluating the correctness of the generated solutions, the study analyzes characteristic strengths and limitations of LLM-generated solutions. The findings of this study indicate that both tested LLMs (GPT-4o and o1-preview) demonstrate advanced problem-solving capabilities on Olympiad-type physics problems, on average outperforming the human participants. Prompting techniques had little effect on GPT-4o's performance, while o1-preview almost consistently outperformed both GPT-4o and the human benchmark. Based on these findings, the study discusses implications for the design of summative and formative assessment in physics education, including how to uphold assessment integrity and support students in critically engaging with LLMs.

摘要

大型语言模型(LLMs)现已广泛普及,覆盖了各个教育层次的学习者。这一发展引发了人们对其可能绕过关键学习过程、破坏现有评估形式完整性的担忧。在物理教育中,问题解决在教学中占据核心地位,因此理解LLMs在物理特定问题解决方面的能力至关重要。这种理解对于制定负责任且符合教学原则的LLMs整合方案具有关键意义。本研究基于一组定义明确的奥林匹克竞赛题目,比较了通用LLM(GPT-4o,采用不同提示技术)与推理优化模型(o1-preview)同德国物理奥林匹克竞赛参与者的解题表现。除评估生成答案的正确性外,研究还分析了LLM生成解决方案的典型优势与局限。研究结果表明,两种测试模型(GPT-4o和o1-preview)在奥林匹克类型物理问题上均展现出高阶解题能力,平均表现优于人类参与者。提示技术对GPT-4o性能影响甚微,而o1-preview几乎在所有情况下都优于GPT-4o和人类基准。基于这些发现,本研究探讨了其对物理教育总结性评估与形成性评估设计的启示,包括如何维护评估完整性以及支持学生批判性运用LLMs。


Deploying Foundation Model-Enabled Air and Ground Robots in the Field: Challenges and Opportunities

Abstract

arXiv:2505.09477v1 Announce Type: cross Abstract: The integration of foundation models (FMs) into robotics has enabled robots to understand natural language and reason about the semantics in their environments. However, existing FM-enabled robots primary operate in closed-world settings, where the robot is given a full prior map or has a full view of its workspace. This paper addresses the deployment of FM-enabled robots in the field, where missions often require a robot to operate in large-scale and unstructured environments. To effectively accomplish these missions, robots must actively explore their environments, navigate obstacle-cluttered terrain, handle unexpected sensor inputs, and operate with compute constraints. We discuss recent deployments of SPINE, our LLM-enabled autonomy framework, in field robotic settings. To the best of our knowledge, we present the first demonstration of large-scale LLM-enabled robot planning in unstructured environments with several kilometers of missions. SPINE is agnostic to a particular LLM, which allows us to distill small language models capable of running onboard size, weight and power (SWaP) limited platforms. Via preliminary model distillation work, we then present the first language-driven UAV planner using on-device language models. We conclude our paper by proposing several promising directions for future research.

摘要

将基础模型(FMs)整合到机器人技术中,使机器人能够理解自然语言并推理其环境语义。然而,现有基于基础模型的机器人主要在封闭世界环境中运行,这些环境通常提供完整先验地图或工作空间全景。本文探讨了基础模型机器人在野外环境中的部署问题,此类任务常要求机器人在大规模非结构化环境中作业。为有效完成这些任务,机器人需主动探索环境、穿越障碍密集地形、处理意外传感器输入,并在计算资源受限条件下运行。我们介绍了近期在野外机器人场景中部署SPINE(基于大语言模型的自主框架)的实践。据我们所知,本研究首次展示了在数公里级非结构化环境中实现大语言模型驱动的机器人规划。SPINE不依赖特定大语言模型,这使得我们能够提炼出可在尺寸、重量和功耗(SWaP)受限平台上运行的小型语言模型。通过初步的模型提炼工作,我们进一步提出了首个采用设备端语言模型驱动的无人机规划器。最后,本文针对未来研究方向提出了若干具有前景的探索路径。


CXMArena: Unified Dataset to benchmark performance in realistic CXM Scenarios

Abstract

arXiv:2505.09436v1 Announce Type: cross Abstract: Large Language Models (LLMs) hold immense potential for revolutionizing Customer Experience Management (CXM), particularly in contact center operations. However, evaluating their practical utility in complex operational environments is hindered by data scarcity (due to privacy concerns) and the limitations of current benchmarks. Existing benchmarks often lack realism, failing to incorporate deep knowledge base (KB) integration, real-world noise, or critical operational tasks beyond conversational fluency. To bridge this gap, we introduce CXMArena, a novel, large-scale synthetic benchmark dataset specifically designed for evaluating AI in operational CXM contexts. Given the diversity in possible contact center features, we have developed a scalable LLM-powered pipeline that simulates the brand's CXM entities that form the foundation of our datasets-such as knowledge articles including product specifications, issue taxonomies, and contact center conversations. The entities closely represent real-world distribution because of controlled noise injection (informed by domain experts) and rigorous automated validation. Building on this, we release CXMArena, which provides dedicated benchmarks targeting five important operational tasks: Knowledge Base Refinement, Intent Prediction, Agent Quality Adherence, Article Search, and Multi-turn RAG with Integrated Tools. Our baseline experiments underscore the benchmark's difficulty: even state of the art embedding and generation models achieve only 68% accuracy on article search, while standard embedding methods yield a low F1 score of 0.3 for knowledge base refinement, highlighting significant challenges for current models necessitating complex pipelines and solutions over conventional techniques.

摘要

大型语言模型(LLMs)在客户体验管理(CXM)领域,尤其是联络中心运营方面具有革命性潜力。然而,数据稀缺(源于隐私问题)与现有基准测试的局限性阻碍了对其在复杂运营环境中实用性的评估。当前基准测试往往缺乏真实性,未能整合深层知识库(KB)、真实场景噪声或除对话流畅性之外的关键运营任务。为弥补这一缺口,我们推出CXMArena——一个专为运营级CXM场景下AI评估设计的新型大规模合成基准数据集。鉴于联络中心功能的多样性,我们开发了基于LLM的可扩展流水线,用于模拟构成数据集基础的品牌CXM实体(如包含产品规格的知识文章、问题分类体系及联络中心对话记录)。通过受控噪声注入(经领域专家指导)与严格自动化验证,这些实体高度还原了真实世界的数据分布。在此基础上,我们发布的CXMArena提供了针对五大核心运营任务的专项基准:知识库优化、意图预测、坐席质量合规、文章检索以及集成工具的多轮RAG。基线实验凸显了该基准的挑战性:即使最先进的嵌入与生成模型在文章检索任务中仅达68%准确率,而标准嵌入方法在知识库优化任务中F1分数低至0.3,这表明当前模型需要构建复杂流水线与解决方案,而非依赖传统技术。


WavReward: Spoken Dialogue Models With Generalist Reward Evaluators

Abstract

arXiv:2505.09558v1 Announce Type: cross Abstract: End-to-end spoken dialogue models such as GPT-4o-audio have recently garnered significant attention in the speech domain. However, the evaluation of spoken dialogue models' conversational performance has largely been overlooked. This is primarily due to the intelligent chatbots convey a wealth of non-textual information which cannot be easily measured using text-based language models like ChatGPT. To address this gap, we propose WavReward, a reward feedback model based on audio language models that can evaluate both the IQ and EQ of spoken dialogue systems with speech input. Specifically, 1) based on audio language models, WavReward incorporates the deep reasoning process and the nonlinear reward mechanism for post-training. By utilizing multi-sample feedback via the reinforcement learning algorithm, we construct a specialized evaluator tailored to spoken dialogue models. 2) We introduce ChatReward-30K, a preference dataset used to train WavReward. ChatReward-30K includes both comprehension and generation aspects of spoken dialogue models. These scenarios span various tasks, such as text-based chats, nine acoustic attributes of instruction chats, and implicit chats. WavReward outperforms previous state-of-the-art evaluation models across multiple spoken dialogue scenarios, achieving a substantial improvement about Qwen2.5-Omni in objective accuracy from 55.1%\% to 91.5%\%. In subjective A/B testing, WavReward also leads by a margin of 83%\%. Comprehensive ablation studies confirm the necessity of each component of WavReward. All data and code will be publicly at https://github.com/jishengpeng/WavReward after the paper is accepted.

摘要

诸如GPT-4o-audio等端到端语音对话模型近期在语音领域获得了广泛关注。然而,语音对话模型在会话性能方面的评估却长期被忽视。这主要是因为智能聊天机器人传递了丰富的非文本信息,而基于文本的语言模型(如ChatGPT)难以有效衡量这些信息。为填补这一空白,我们提出了WavReward——一种基于音频语言模型的奖励反馈模型,能够通过语音输入评估语音对话系统的智商(IQ)与情商(EQ)。具体而言:1)基于音频语言模型,WavReward融合了深度推理过程与非线性奖励机制进行后训练。通过强化学习算法的多样本反馈,我们构建了专用于语音对话模型的评估器;2)我们发布了ChatReward-30K偏好数据集用于训练WavReward,该数据集涵盖语音对话模型的理解与生成双维度,包含文本聊天、指令聊天的九种声学特征及隐性聊天等多任务场景。WavReward在多种语音对话场景中均超越现有最优评估模型,将Qwen2.5-Omni的客观准确率从55.1%显著提升至91.5%。在主观A/B测试中,WavReward亦以83%的优势领先。全面的消融实验验证了WavReward各模块的必要性。所有数据与代码将在论文录用后公开于https://github.com/jishengpeng/WavReward。


Ethics and Persuasion in Reinforcement Learning from Human Feedback: A Procedural Rhetorical Approach

Abstract

arXiv:2505.09576v1 Announce Type: cross Abstract: Since 2022, versions of generative AI chatbots such as ChatGPT and Claude have been trained using a specialized technique called Reinforcement Learning from Human Feedback (RLHF) to fine-tune language model output using feedback from human annotators. As a result, the integration of RLHF has greatly enhanced the outputs of these large language models (LLMs) and made the interactions and responses appear more "human-like" than those of previous versions using only supervised learning. The increasing convergence of human and machine-written text has potentially severe ethical, sociotechnical, and pedagogical implications relating to transparency, trust, bias, and interpersonal relations. To highlight these implications, this paper presents a rhetorical analysis of some of the central procedures and processes currently being reshaped by RLHF-enhanced generative AI chatbots: upholding language conventions, information seeking practices, and expectations for social relationships. Rhetorical investigations of generative AI and LLMs have, to this point, focused largely on the persuasiveness of the content generated. Using Ian Bogost's concept of procedural rhetoric, this paper shifts the site of rhetorical investigation from content analysis to the underlying mechanisms of persuasion built into RLHF-enhanced LLMs. In doing so, this theoretical investigation opens a new direction for further inquiry in AI ethics that considers how procedures rerouted through AI-driven technologies might reinforce hegemonic language use, perpetuate biases, decontextualize learning, and encroach upon human relationships. It will therefore be of interest to educators, researchers, scholars, and the growing number of users of generative AI chatbots.

摘要

自2022年起,ChatGPT和Claude等生成式AI聊天机器人通过采用名为"人类反馈强化学习"(RLHF)的专项技术,利用人类标注员的反馈对语言模型输出进行微调。这种RLHF技术的整合显著提升了大型语言模型(LLMs)的输出质量,使其交互与回应相较于仅使用监督学习的早期版本更显"拟人化"。人机文本的日益趋同对透明度、信任度、偏见及人际关系等方面可能产生严重的伦理、社会技术和教育学影响。为阐明这些影响,本文对当前被RLHF增强型生成式AI重塑的核心流程进行了修辞学分析,包括:语言规范的维护、信息检索实践以及社会关系预期。迄今为止,关于生成式AI与LLMs的修辞学研究主要集中于生成内容的说服力。本文运用Ian Bogost的程序修辞理论,将修辞研究的焦点从内容分析转向RLHF增强型LLMs内置的说服机制。通过这一理论探索,本研究为AI伦理领域的后续探究开辟了新方向,重点关注经AI技术重构的程序如何可能强化语言霸权、延续偏见、使学习脱离情境,以及侵蚀人际关系。因此,本研究将对教育工作者、研究人员、学者以及日益增长的生成式AI用户群体具有重要参考价值。


Customizing a Large Language Model for VHDL Design of High-Performance Microprocessors

Abstract

arXiv:2505.09610v1 Announce Type: cross Abstract: The use of Large Language Models (LLMs) in hardware design has taken off in recent years, principally through its incorporation in tools that increase chip designer productivity. There has been considerable discussion about the use of LLMs in RTL specifications of chip designs, for which the two most popular languages are Verilog and VHDL. LLMs and their use in Verilog design has received significant attention due to the higher popularity of the language, but little attention so far has been given to VHDL despite its continued popularity in the industry. There has also been little discussion about the unique needs of organizations that engage in high-performance processor design, and techniques to deploy AI solutions in these settings. In this paper, we describe our journey in developing a Large Language Model (LLM) specifically for the purpose of explaining VHDL code, a task that has particular importance in an organization with decades of experience and assets in high-performance processor design. We show how we developed test sets specific to our needs and used them for evaluating models as we performed extended pretraining (EPT) of a base LLM. Expert evaluation of the code explanations produced by the EPT model increased to 69% compared to a base model rating of 43%. We further show how we developed an LLM-as-a-judge to gauge models similar to expert evaluators. This led us to deriving and evaluating a host of new models, including an instruction-tuned version of the EPT model with an expected expert evaluator rating of 71%. Our experiments also indicate that with the potential use of newer base models, this rating can be pushed to 85% and beyond. We conclude with a discussion on further improving the quality of hardware design LLMs using exciting new developments in the Generative AI world.

摘要

近年来,大型语言模型(LLM)在硬件设计领域的应用迅速发展,主要体现在提升芯片设计效率的工具集成中。关于LLM在芯片设计寄存器传输级(RTL)规范中的应用已有大量讨论,其中Verilog和VHDL是最常用的两种语言。由于Verilog的较高普及度,LLM在该语言设计中的应用备受关注,而VHDL尽管在工业界持续流行却鲜少获得研究重视。同时,针对高性能处理器设计机构的特殊需求及AI解决方案部署技术的研究也较为匮乏。本文阐述了我们开发专用大型语言模型以解释VHDL代码的研究历程,该任务对于拥有数十年高性能处理器设计经验与资产的组织具有特殊意义。我们展示了如何构建符合需求的测试集,并在基础LLM的扩展预训练(EPT)过程中进行模型评估。经专家评定,EPT模型生成的代码解释准确率从基础模型的43%提升至69%。我们进一步开发了"LLM即评判员"系统,其评估结果与专家评判具有一致性。基于此,我们推导并评估了包括指令调优版EPT模型在内的多个新模型,其预期专家评分达到71%。实验表明,若采用更新的基础模型,该评分可提升至85%以上。最后,我们探讨了如何利用生成式人工智能领域的新进展来进一步提升硬件设计LLM的质量。


PSPO*: An Effective Process-supervised Policy Optimization for Reasoning Alignment

Abstract

arXiv:2411.11681v3 Announce Type: replace Abstract: Process supervision enhances the performance of large language models in reasoning tasks by providing feedback at each step of chain-of-thought reasoning. However, due to the lack of effective process supervision methods, even advanced large language models are prone to logical errors and redundant reasoning. We claim that the effectiveness of process supervision significantly depends on both the accuracy and the length of reasoning chains. Moreover, we identify that these factors exhibit a nonlinear relationship with the overall reward score of the reasoning process. Inspired by these insights, we propose a novel process supervision paradigm, PSPO*, which systematically outlines the workflow from reward model training to policy optimization, and highlights the importance of nonlinear rewards in process supervision. Based on PSPO*, we develop the PSPO-WRS, which considers the number of reasoning steps in determining reward scores and utilizes an adjusted Weibull distribution for nonlinear reward shaping. Experimental results on six mathematical reasoning datasets demonstrate that PSPO-WRS consistently outperforms current mainstream models.

摘要

过程监督通过在对思维链推理的每一步提供反馈,提升了大型语言模型在推理任务中的表现。然而,由于缺乏有效的过程监督方法,即使是先进的大型语言模型也容易出现逻辑错误和冗余推理。我们认为,过程监督的有效性显著依赖于推理链的准确性和长度。此外,我们发现这些因素与推理过程的整体奖励分数呈现非线性关系。基于这些发现,我们提出了一种新颖的过程监督范式PSPO*,系统性地阐述了从奖励模型训练到策略优化的流程,并强调了非线性奖励在过程监督中的重要性。基于PSPO*,我们进一步开发了PSPO-WRS,该方法在确定奖励分数时考虑了推理步骤的数量,并利用调整后的威布尔分布实现非线性奖励塑造。在六个数学推理数据集上的实验结果表明,PSPO-WRS始终优于当前主流模型。


WorldView-Bench: A Benchmark for Evaluating Global Cultural Perspectives in Large Language Models

Abstract

arXiv:2505.09595v1 Announce Type: cross Abstract: Large Language Models (LLMs) are predominantly trained and aligned in ways that reinforce Western-centric epistemologies and socio-cultural norms, leading to cultural homogenization and limiting their ability to reflect global civilizational plurality. Existing benchmarking frameworks fail to adequately capture this bias, as they rely on rigid, closed-form assessments that overlook the complexity of cultural inclusivity. To address this, we introduce WorldView-Bench, a benchmark designed to evaluate Global Cultural Inclusivity (GCI) in LLMs by analyzing their ability to accommodate diverse worldviews. Our approach is grounded in the Multiplex Worldview proposed by Senturk et al., which distinguishes between Uniplex models, reinforcing cultural homogenization, and Multiplex models, which integrate diverse perspectives. WorldView-Bench measures Cultural Polarization, the exclusion of alternative perspectives, through free-form generative evaluation rather than conventional categorical benchmarks. We implement applied multiplexity through two intervention strategies: (1) Contextually-Implemented Multiplex LLMs, where system prompts embed multiplexity principles, and (2) Multi-Agent System (MAS)-Implemented Multiplex LLMs, where multiple LLM agents representing distinct cultural perspectives collaboratively generate responses. Our results demonstrate a significant increase in Perspectives Distribution Score (PDS) entropy from 13% at baseline to 94% with MAS-Implemented Multiplex LLMs, alongside a shift toward positive sentiment (67.7%) and enhanced cultural balance. These findings highlight the potential of multiplex-aware AI evaluation in mitigating cultural bias in LLMs, paving the way for more inclusive and ethically aligned AI systems.

摘要

大型语言模型(LLMs)当前的训练与对齐方式主要强化了以西方为中心的认识论和社会文化规范,导致文化同质化并削弱了其反映全球文明多样性的能力。现有评估框架未能充分捕捉这种偏见,因其依赖僵化、封闭式的评估方法,忽视了文化包容性的复杂性。为此,我们提出WorldView-Bench——一个通过分析LLMs容纳多元世界观能力来评估全球文化包容性(GCI)的基准。该方法基于Senturk等人提出的"多重世界观"理论,区分了强化文化同质化的"单一型"模型与整合多元视角的"多重型"模型。WorldView-Bench通过自由生成式评估(而非传统分类基准)测量"文化极化"(即对异质观点的排斥)。我们通过两种干预策略实现应用多重性:(1) 上下文实现的多重LLMs——系统提示嵌入多重性原则;(2) 多智能体系统(MAS)实现的多重LLMs——代表不同文化视角的多个LLM智能体协同生成响应。实验结果表明,MAS实现的多重LLMs将"视角分布熵值"(PDS)从基线13%显著提升至94%,同时促进积极情感倾向(67.7%)并增强文化平衡性。这些发现揭示了多重意识AI评估在缓解LLMs文化偏见方面的潜力,为构建更具包容性与伦理对齐的AI系统开辟了新路径。


How Hungry is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference

Abstract

arXiv:2505.09598v1 Announce Type: cross Abstract: As large language models (LLMs) spread across industries, understanding their environmental footprint at the inference level is no longer optional; it is essential. However, most existing studies exclude proprietary models, overlook infrastructural variability and overhead, or focus solely on training, even as inference increasingly dominates AI's environmental impact. To bridge this gap, this paper introduces a novel infrastructure-aware benchmarking framework for quantifying the environmental footprint of LLM inference across 30 state-of-the-art models as deployed in commercial data centers. Our framework combines public API performance data with region-specific environmental multipliers and statistical inference of hardware configurations. We additionally utilize cross-efficiency Data Envelopment Analysis (DEA) to rank models by performance relative to environmental cost. Our results show that o3 and DeepSeek-R1 emerge as the most energy-intensive models, consuming over 33 Wh per long prompt, more than 70 times the consumption of GPT-4.1 nano, and that Claude-3.7 Sonnet ranks highest in eco-efficiency. While a single short GPT-4o query consumes 0.43 Wh, scaling this to 700 million queries/day results in substantial annual environmental impacts. These include electricity use comparable to 35,000 U.S. homes, freshwater evaporation matching the annual drinking needs of 1.2 million people, and carbon emissions requiring a Chicago-sized forest to offset. These findings illustrate a growing paradox: although individual queries are efficient, their global scale drives disproportionate resource consumption. Our study provides a standardized, empirically grounded methodology for benchmarking the sustainability of LLM deployments, laying a foundation for future environmental accountability in AI development and sustainability standards.

摘要

随着大语言模型(LLMs)在各行业的广泛应用,从推理层面理解其环境足迹已从可选变为必要。然而现有研究大多排除专有模型、忽视基础设施差异与开销,或仅关注训练阶段,尽管推理正日益成为人工智能环境影响的主导因素。为填补这一空白,本文提出一种新型基础设施感知基准测试框架,用于量化商业数据中心部署的30个前沿LLM推理的环境足迹。该框架整合了公开API性能数据、区域特异性环境乘数以及硬件配置的统计推断,并采用交叉效率数据包络分析(DEA)评估模型在环境成本约束下的性能表现。研究发现:o3和DeepSeek-R1是能耗最高的模型,处理长提示消耗超33瓦时,达GPT-4.1 nano的70余倍;Claude-3.7 Sonnet则展现出最佳生态效率。单个GPT-4o短查询虽仅耗电0.43瓦时,但日频次达7亿次时将产生显著年化环境影响——其电力消耗相当于3.5万美国家庭用电,淡水蒸发量满足120万人年饮用水需求,碳排放需芝加哥规模的森林才能抵消。这些发现揭示了一个日益凸显的矛盾:尽管单次查询高效,但全球规模仍导致不成比例的资源消耗。本研究为LLM部署的可持续性评估提供了标准化实证方法,为未来AI开发的环境责任与可持续标准奠定基础。


On the Partitioning of GPU Power among Multi-Instances

Abstract

arXiv:2501.17752v2 Announce Type: replace Abstract: Efficient power management in cloud data centers is essential for reducing costs, enhancing performance, and minimizing environmental impact. GPUs, critical for tasks like machine learning (ML) and GenAI, are major contributors to power consumption. NVIDIA's Multi-Instance GPU (MIG) technology improves GPU utilization by enabling isolated partitions with per-partition resource tracking, facilitating GPU sharing by multiple tenants. However, accurately apportioning GPU power consumption among MIG instances remains challenging due to a lack of hardware support. This paper addresses this challenge by developing software methods to estimate power usage per MIG partition. We analyze NVIDIA GPU utilization metrics and find that light-weight methods with good accuracy can be difficult to construct. We hence explore the use of ML-based power models to enable accurate, partition-level power estimation. Our findings reveal that a single generic offline power model or modeling method is not applicable across diverse workloads, especially with concurrent MIG usage, and that online models constructed using partition-level utilization metrics of workloads under execution can significantly improve accuracy. Using NVIDIA A100 GPUs, we demonstrate this approach for accurate partition-level power estimation for workloads including matrix multiplication and Large Language Model inference, contributing to transparent and fair carbon reporting.

摘要

高效能云计算数据中心的电力管理对于降低成本、提升性能及减少环境影响至关重要。作为机器学习(ML)和生成式人工智能(GenAI)等任务的核心硬件,GPU是功耗的主要来源。英伟达多实例GPU(MIG)技术通过创建资源独立分配且支持分区间资源监控的隔离分区,实现了多租户GPU共享,从而提升GPU利用率。然而由于缺乏硬件支持,如何精确分配MIG实例间的GPU功耗仍是技术难点。本研究通过开发软件级功耗估算方法解决该问题,分析英伟达GPU利用率指标后发现:构建兼具轻量性与高精度的传统方法存在困难。为此我们探索基于机器学习的功耗建模方案,以实现精准的分区级功耗估算。实验表明:单一通用离线功耗模型或建模方法无法适配多样化工作负载(尤其在并发MIG场景下),而基于运行时工作负载分区级利用率指标构建的在线模型能显著提升精度。基于英伟达A100 GPU的测试验证了该方法在矩阵乘法和大语言模型推理等场景中实现精准分区级功耗估算的能力,为透明公正的碳排放报告提供了技术支撑。


Construction and Application of Materials Knowledge Graph in Multidisciplinary Materials Science via Large Language Model

Abstract

arXiv:2404.03080v4 Announce Type: replace-cross Abstract: Knowledge in materials science is widely dispersed across extensive scientific literature, posing significant challenges to the efficient discovery and integration of new materials. Traditional methods, often reliant on costly and time-consuming experimental approaches, further complicate rapid innovation. Addressing these challenges, the integration of artificial intelligence with materials science has opened avenues for accelerating the discovery process, though it also demands precise annotation, data extraction, and traceability of information. To tackle these issues, this article introduces the Materials Knowledge Graph (MKG), which utilizes advanced natural language processing techniques integrated with large language models to extract and systematically organize a decade's worth of high-quality research into structured triples, contains 162,605 nodes and 731,772 edges. MKG categorizes information into comprehensive labels such as Name, Formula, and Application, structured around a meticulously designed ontology, thus enhancing data usability and integration. By implementing network-based algorithms, MKG not only facilitates efficient link prediction but also significantly reduces reliance on traditional experimental methods. This structured approach not only streamlines materials research but also lays the groundwork for more sophisticated science knowledge graphs.

摘要

材料科学知识广泛分散于海量科学文献中,这对新材料的高效发现与整合提出了重大挑战。传统方法通常依赖成本高昂且耗时的实验手段,进一步阻碍了快速创新进程。为解决这些问题,人工智能与材料科学的融合为加速发现过程开辟了新途径,但同时也要求实现精确的标注、数据提取和信息溯源。针对这些需求,本文提出材料知识图谱(MKG),该系统集成先进自然语言处理技术与大语言模型,从十年间的高质量研究中抽取并系统化组织结构化三元组,包含162,605个节点和731,772条边。MKG通过精心设计的本体框架,将信息分类为名称、分子式、应用等综合性标签,从而提升数据可用性与整合度。通过采用基于网络的算法,MKG不仅能有效实现链接预测,还可显著降低对传统实验方法的依赖。这种结构化方法不仅优化了材料研究流程,更为构建更复杂的科学知识图谱奠定了基础。


UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning

Abstract

arXiv:2503.21620v4 Announce Type: replace Abstract: The recent DeepSeek-R1 has showcased the emergence of reasoning capabilities in LLMs through reinforcement learning (RL) with rule-based rewards. Despite its success in language models, its application in multi-modal domains, particularly in graphic user interface (GUI) agent tasks, remains under-explored. To address this issue, we propose UI-R1, the first framework to explore how rule-based RL can enhance the reasoning capabilities of multimodal large language models (MLLMs) for GUI action prediction tasks. Specifically, UI-R1 introduces a novel rule-based action reward, enabling model optimization via policy-based algorithms such as Group Relative Policy Optimization (GRPO). For efficient training, we curate a small yet high-quality dataset of 136 challenging tasks, encompassing five common action types on mobile devices. Experimental results demonstrate that our proposed UI-R1-3B achieves significant improvements over the base model (i.e. Qwen2.5-VL-3B) on both in-domain (ID) and out-of-domain (OOD) tasks, with average accuracy gains of 22.1% on ScreenSpot, 6.0% on ScreenSpot-Pro, and 12.7% on ANDROIDCONTROL. Furthermore, UI-R1-3B delivers competitive performance compared to larger models (e.g., OS-Atlas-7B) trained via supervised fine-tuning (SFT) on 76K samples. We additionally develop an optimized version, UI-R1-E-3B, which significantly improves both grounding efficiency and accuracy. These results underscore the potential of rule-based reinforcement learning to advance GUI understanding and control, paving the way for future research in this domain. Code website: https://github.com/lll6gg/UI-R1.

摘要

近期发布的DeepSeek-R1通过基于规则的强化学习(RL)展示了大型语言模型(LLM)中推理能力的涌现。尽管该方法在语言模型中取得成功,但其在多模态领域(尤其是图形用户界面(GUI)智能体任务)的应用仍待探索。针对这一问题,我们提出UI-R1框架,首次探索基于规则的RL如何增强多模态大语言模型(MLLM)在GUI动作预测任务中的推理能力。具体而言,UI-R1引入了一种新颖的基于规则的动作奖励机制,支持通过基于策略的算法(如组相对策略优化GRPO)进行模型优化。为提升训练效率,我们构建了一个包含136项挑战性任务的小型高质量数据集,涵盖移动设备上五种常见动作类型。实验结果表明,所提出的UI-R1-3B模型在领域内(ID)和领域外(OOD)任务上均显著优于基础模型(即Qwen2.5-VL-3B),在ScreenSpot、ScreenSpot-Pro和ANDROIDCONTROL数据集上平均准确率分别提升22.1%、6.0%和12.7%。此外,与通过监督微调(SFT)在76K样本上训练的更大模型(如OS-Atlas-7B)相比,UI-R1-3B展现出具有竞争力的性能。我们还开发了优化版本UI-R1-E-3B,显著提升了基础定位效率和准确率。这些结果证明了基于规则的强化学习在推进GUI理解与控制方面的潜力,为该领域的未来研究奠定了基础。代码网站:https://github.com/lll6gg/UI-R1。


What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks

Abstract

arXiv:2411.03343v2 Announce Type: replace-cross Abstract: Jailbreaks have been a central focus of research regarding the safety and reliability of large language models (LLMs), yet the mechanisms underlying these attacks remain poorly understood. While previous studies have predominantly relied on linear methods to detect jailbreak attempts and model refusals, we take a different approach by examining both linear and non-linear features in prompts that lead to successful jailbreaks. First, we introduce a novel dataset comprising 10,800 jailbreak attempts spanning 35 diverse attack methods. Leveraging this dataset, we train probes to classify successful from unsuccessful jailbreaks using the latent representations corresponding to prompt tokens. Notably, we find that even when probes achieve high accuracy in predicting the success of jailbreaks, their performance often fails to generalize to unseen attack methods. This reveals that different jailbreaking strategies exploit different non-linear, non-universal features. Next, we demonstrate that non-linear probes provide a powerful tool for steering model behavior. Specifically, we use these probes to guide targeted latent space perturbations, enabling us to effectively modulate the model's robustness against jailbreaks. Overall, our findings challenge the assumption that jailbreaks can be fully understood through linear or simple universal prompt features alone, highlighting the importance of a nuanced understanding of the mechanisms behind LLM vulnerabilities.

摘要

越狱攻击一直是大型语言模型(LLM)安全性与可靠性研究的核心议题,然而这些攻击的内在机制仍未被充分理解。现有研究主要依赖线性方法检测越狱尝试和模型拒绝行为,我们则通过分析成功越狱提示中的线性与非线性特征提出了新视角。首先,我们构建了一个包含35种攻击方法、总计10,800次越狱尝试的新型数据集。利用该数据集,我们基于提示标记的潜在表征训练探针来区分成功与失败的越狱。值得注意的是,即使探针在预测越狱成功率时表现出高准确度,其性能往往无法泛化至未见过的攻击方法,这表明不同越狱策略利用了各异的非线性、非普适性特征。进一步研究表明,非线性探针可作为引导模型行为的有效工具:通过指导定向潜在空间扰动,我们成功实现了对模型抗越狱鲁棒性的精准调控。本研究从根本上挑战了"仅通过线性或简单普适性提示特征即可完全理解越狱机制"的假设,强调了深入理解LLM脆弱性背后复杂机制的重要性。


FAMMA: A Benchmark for Financial Domain Multilingual Multimodal Question Answering

Abstract

arXiv:2410.04526v3 Announce Type: replace-cross Abstract: In this paper, we introduce FAMMA, an open-source benchmark for \underline{f}in\underline{a}ncial \underline{m}ultilingual \underline{m}ultimodal question \underline{a}nswering (QA). Our benchmark aims to evaluate the abilities of large language models (LLMs) in answering complex reasoning questions that require advanced financial knowledge. The benchmark has two versions: FAMMA-Basic consists of 1,945 questions extracted from university textbooks and exams, along with human-annotated answers and rationales; FAMMA-LivePro consists of 103 novel questions created by human domain experts, with answers and rationales held out from the public for a contamination-free evaluation. These questions cover advanced knowledge of 8 major subfields in finance (e.g., corporate finance, derivatives, and portfolio management). Some are in Chinese or French, while a majority of them are in English. Each question has some non-text data such as charts, diagrams, or tables. Our experiments reveal that FAMMA poses a significant challenge on LLMs, including reasoning models such as GPT-o1 and DeepSeek-R1. Additionally, we curated 1,270 reasoning trajectories of DeepSeek-R1 on the FAMMA-Basic data, and fine-tuned a series of open-source Qwen models using this reasoning data. We found that training a model on these reasoning trajectories can significantly improve its performance on FAMMA-LivePro. We released our leaderboard, data, code, and trained models at https://famma-bench.github.io/famma/.

摘要

本文介绍了FAMMA——一个开源的金融多语言多模态问答(QA)基准测试平台。该基准旨在评估大语言模型(LLMs)在需要高级金融知识的复杂推理问题上的表现。基准包含两个版本:FAMMA-Basic由1,945道从大学教材和考试中提取的问题组成,附带人工标注的答案和解析;FAMMA-LivePro则包含103道由领域专家原创的新题,其答案与解析未公开以确保无污染的评估。这些问题涵盖公司金融、衍生品、投资组合管理等8个金融子领域的高级知识,部分题目为中文或法语,多数为英文,且每道题均配有图表等非文本数据。实验表明FAMMA对包括GPT-o1和DeepSeek-R1在内的推理模型构成显著挑战。此外,我们整理了DeepSeek-R1在FAMMA-Basic上的1,270条推理轨迹,并利用这些数据微调了一系列开源的Qwen模型。研究发现基于推理轨迹训练的模型在FAMMA-LivePro上表现显著提升。我们已公开排行榜、数据、代码及训练模型,详见https://famma-bench.github.io/famma/


Activation Steering in Neural Theorem Provers

Abstract

arXiv:2502.15507v3 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have shown promise in proving formal theorems using proof assistants like Lean. However, current state of the art language models struggles to predict next step in proofs leading practitioners to use different sampling techniques to improve LLMs capabilities. We observe that the LLM is capable of predicting the correct tactic; however, it faces challenges in ranking it appropriately within the set of candidate tactics, affecting the overall selection process. To overcome this hurdle, we use activation steering to guide LLMs responses to improve the generations at the time of inference. Our results suggest that activation steering offers a promising lightweight alternative to specialized fine-tuning for enhancing theorem proving capabilities in LLMs, particularly valuable in resource-constrained environments.

摘要

大型语言模型(LLMs)在使用如Lean等证明辅助工具验证形式定理方面展现出潜力。然而,当前最先进的语言模型在预测证明步骤时仍存在困难,导致实践者采用不同采样技术来提升模型能力。我们发现,LLM能够预测出正确的策略(tactic),但在候选策略集合中对其进行适当排序时面临挑战,从而影响整体选择过程。为克服这一障碍,我们采用激活导向技术来引导LLM在推理时的生成过程。研究结果表明,激活导向为增强LLM的定理证明能力提供了一种轻量级替代方案,尤其适用于资源受限的环境,其效果可与专用微调方法相媲美。


Principled Data Selection for Alignment: The Hidden Risks of Difficult Examples

Abstract

arXiv:2502.09650v2 Announce Type: replace-cross Abstract: The alignment of large language models (LLMs) often assumes that using more clean data yields better outcomes, overlooking the match between model capacity and example difficulty. Challenging this, we propose a new principle: Preference data vary in difficulty, and overly difficult examples hinder alignment, by exceeding the model's capacity. Through systematic experimentation, we validate this principle with three key findings: (1) preference examples vary in difficulty, as evidenced by consistent learning orders across alignment runs; (2) overly difficult examples significantly degrade performance across four LLMs and two datasets; and (3) the capacity of a model dictates its threshold for handling difficult examples, underscoring a critical relationship between data selection and model capacity. Building on this principle, we introduce Selective DPO, which filters out overly difficult examples. This simple adjustment improves alignment performance by 9-16% in win rates on the AlpacaEval 2 benchmark compared to the DPO baseline, suppressing a series of DPO variants with different algorithmic adjustments. Together, these results illuminate the importance of aligning data difficulty with model capacity, offering a transformative perspective for improving alignment strategies in LLMs. Code is available at https://github.com/glorgao/SelectiveDPO.

摘要

大型语言模型(LLM)的对齐研究通常假设使用更多清洁数据会产生更好结果,却忽视了模型能力与样本难度之间的匹配关系。对此我们提出新原则:偏好数据存在难度差异,过度困难的样本会因超出模型能力而阻碍对齐效果。通过系统实验,我们验证该原则并获得三项关键发现:(1)偏好样本具有难度差异,表现为不同对齐过程中稳定的学习顺序;(2)过度困难样本会显著降低四种LLM在两个数据集上的性能表现;(3)模型能力决定了其处理困难样本的阈值,揭示了数据选择与模型能力间的关键关联。基于此原则,我们提出选择性DPO方法,通过过滤过度困难样本实现优化。这一简单调整使AlpacaEval 2基准测试的胜率较DPO基线提升9-16%,优于采用不同算法调整的一系列DPO变体。这些成果共同阐明了数据难度与模型能力相匹配的重要性,为改进LLM对齐策略提供了变革性视角。


ThreatModeling-LLM: Automating Threat Modeling using Large Language Models for Banking System

Abstract

arXiv:2411.17058v2 Announce Type: replace-cross Abstract: Threat modeling is a crucial component of cybersecurity, particularly for industries such as banking, where the security of financial data is paramount. Traditional threat modeling approaches require expert intervention and manual effort, often leading to inefficiencies and human error. The advent of Large Language Models (LLMs) offers a promising avenue for automating these processes, enhancing both efficiency and efficacy. However, this transition is not straightforward due to three main challenges: (1) the lack of publicly available, domain-specific datasets, (2) the need for tailored models to handle complex banking system architectures, and (3) the requirement for real-time, adaptive mitigation strategies that align with compliance standards like NIST 800-53. In this paper, we introduce ThreatModeling-LLM, a novel and adaptable framework that automates threat modeling for banking systems using LLMs. ThreatModeling-LLM operates in three stages: 1) dataset creation, 2) prompt engineering and 3) model fine-tuning. We first generate a benchmark dataset using Microsoft Threat Modeling Tool (TMT). Then, we apply Chain of Thought (CoT) and Optimization by PROmpting (OPRO) on the pre-trained LLMs to optimize the initial prompt. Lastly, we fine-tune the LLM using Low-Rank Adaptation (LoRA) based on the benchmark dataset and the optimized prompt to improve the threat identification and mitigation generation capabilities of pre-trained LLMs.

摘要

威胁建模是网络安全的关键组成部分,对于金融数据安全至关重要的银行业等行业尤为如此。传统威胁建模方法需要专家介入和人工操作,往往导致效率低下和人为错误。大型语言模型(LLMs)的出现为自动化这些流程提供了可行途径,既能提升效率又可增强效果。然而这一转型面临三大挑战:(1)缺乏公开可用的领域专用数据集;(2)需要定制化模型来处理复杂的银行系统架构;(3)必须满足实时自适应的缓解策略要求,以符合NIST 800-53等合规标准。本文提出ThreatModeling-LLM——一个基于LLMs的银行系统自动化威胁建模新型适配框架。该框架分三阶段运作:1)数据集创建,2)提示工程,3)模型微调。我们首先使用微软威胁建模工具(TMT)生成基准数据集,随后在预训练LLMs上应用思维链(CoT)和提示优化(OPRO)来优化初始提示,最后基于基准数据集和优化后的提示,采用低秩自适应(LoRA)对LLM进行微调,从而提升预训练模型在威胁识别和缓解方案生成方面的能力。


FAS: Fast ANN-SNN Conversion for Spiking Large Language Models

Abstract

arXiv:2502.04405v2 Announce Type: replace-cross Abstract: Spiking Large Language Models have been shown as a good alternative to LLMs in various scenarios. Existing methods for creating Spiking LLMs, i.e., direct training and ANN-SNN conversion, often suffer from performance degradation and relatively high computational costs. To address these issues, we propose a novel Fast ANN-SNN conversion strategy (FAS) that transforms LLMs into spiking LLMs in two stages. The first stage employs a full-parameter fine-tuning of pre-trained models, so it does not need any direct training from scratch. The second stage introduces a coarse-to-fine calibration method to reduce conversion errors and improve accuracy. Experiments on both language and vision-language tasks across four different scales of LLMs demonstrate that FAS can achieve state-of-the-art performance yet with significantly reduced inference latency and computational costs. Notably, FAS only takes eight timesteps to achieve an accuracy of 3% higher than that of the OPT-7B model, while reducing energy consumption by 96.63%. The source code is available at https://github.com/lc783/FAS

摘要

脉冲大语言模型已被证明是传统大语言模型在多种场景下的有效替代方案。现有创建脉冲大语言模型的方法(即直接训练和人工神经网络-脉冲神经网络转换)常面临性能下降和计算成本较高的问题。为解决这些问题,我们提出了一种新颖的快速人工神经网络-脉冲神经网络转换策略(FAS),该策略通过两阶段将大语言模型转换为脉冲大语言模型。第一阶段采用预训练模型的全参数微调,因此无需从头开始直接训练;第二阶段引入粗粒度到细粒度的校准方法以减少转换误差并提升精度。在四种不同规模大语言模型上的语言及视觉-语言任务实验表明,FAS能以显著降低的推理延迟和计算成本实现最先进性能。值得注意的是,FAS仅需八个时间步即可获得比OPT-7B模型高3%的准确率,同时能耗降低96.63%。源代码详见https://github.com/lc783/FAS


InductionBench: LLMs Fail in the Simplest Complexity Class

Abstract

arXiv:2502.15823v4 Announce Type: replace-cross Abstract: Large language models (LLMs) have shown remarkable improvements in reasoning and many existing benchmarks have been addressed by models such as o1 and o3 either fully or partially. However, a majority of these benchmarks emphasize deductive reasoning, including mathematical and coding tasks in which rules such as mathematical axioms or programming syntax are clearly defined, based on which LLMs can plan and apply these rules to arrive at a solution. In contrast, inductive reasoning, where one infers the underlying rules from observed data, remains less explored. Such inductive processes lie at the heart of scientific discovery, as they enable researchers to extract general principles from empirical observations. To assess whether LLMs possess this capacity, we introduce InductionBench, a new benchmark designed to evaluate the inductive reasoning ability of LLMs. Our experimental findings reveal that even the most advanced models available struggle to master the simplest complexity classes within the subregular hierarchy of functions, highlighting a notable deficiency in current LLMs' inductive reasoning capabilities. Coda and data are available https://github.com/Wenyueh/inductive_reasoning_benchmark.

摘要

大型语言模型(LLMs)在推理能力上展现出显著进步,诸如o1和o3等模型已完全或部分解决了现有众多基准测试。然而,这些基准大多侧重于演绎推理——包括数学与编程任务,其中数学公理或编程语法等规则被明确定义,LLMs可据此规划并应用这些规则以获得解决方案。相比之下,归纳推理(即从观测数据中推断潜在规律)的研究仍较匮乏。此类归纳过程是科学发现的核心,因其使研究者能从经验观察中提炼普适原理。为评估LLMs是否具备该能力,我们提出InductionBench——一个专门评估LLMs归纳推理能力的新基准。实验结果表明,即使当前最先进的模型也难以掌握函数次正则层级中最简单的复杂度类别,这凸显出现有LLMs在归纳推理能力上的显著缺陷。代码与数据详见https://github.com/Wenyueh/inductive_reasoning_benchmark。


Evaluating Clinical Competencies of Large Language Models with a General Practice Benchmark

Abstract

arXiv:2503.17599v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have demonstrated considerable potential in general practice. However, existing benchmarks and evaluation frameworks primarily depend on exam-style or simplified question-answer formats, lacking a competency-based structure aligned with the real-world clinical responsibilities encountered in general practice. Consequently, the extent to which LLMs can reliably fulfill the duties of general practitioners (GPs) remains uncertain. In this work, we propose a novel evaluation framework to assess the capability of LLMs to function as GPs. Based on this framework, we introduce a general practice benchmark (GPBench), whose data are meticulously annotated by domain experts in accordance with routine clinical practice standards. We evaluate ten state-of-the-art LLMs and analyze their competencies. Our findings indicate that current LLMs are not yet ready for deployment in such settings without human oversight, and further optimization specifically tailored to the daily responsibilities of GPs is essential.

摘要

大语言模型(LLMs)在通用实践中展现出显著潜力。然而,现有基准测试与评估框架主要依赖考试式或简化的问答形式,缺乏与真实世界全科医生临床职责相匹配的能力导向结构。因此,LLMs能否可靠履行全科医生(GPs)职责仍不明确。本研究提出新型评估框架以检验LLMs作为全科医生的能力。基于该框架,我们构建了全科医学基准(GPBench),其数据严格遵循常规临床实践标准并由领域专家标注。我们评估了十种前沿LLMs并分析其核心能力。研究结果表明,当前LLMs在无人工监督的情况下尚不适合直接部署于此类场景,且亟需针对全科医生日常职责进行专项优化。


Llama-Nemotron: Efficient Reasoning Models

Abstract

arXiv:2505.00949v3 Announce Type: replace-cross Abstract: We introduce the Llama-Nemotron series of models, an open family of heterogeneous reasoning models that deliver exceptional reasoning capabilities, inference efficiency, and an open license for enterprise use. The family comes in three sizes -- Nano (8B), Super (49B), and Ultra (253B) -- and performs competitively with state-of-the-art reasoning models such as DeepSeek-R1 while offering superior inference throughput and memory efficiency. In this report, we discuss the training procedure for these models, which entails using neural architecture search from Llama 3 models for accelerated inference, knowledge distillation, and continued pretraining, followed by a reasoning-focused post-training stage consisting of two main parts: supervised fine-tuning and large scale reinforcement learning. Llama-Nemotron models are the first open-source models to support a dynamic reasoning toggle, allowing users to switch between standard chat and reasoning modes during inference. To further support open research and facilitate model development, we provide the following resources: 1. We release the Llama-Nemotron reasoning models -- LN-Nano, LN-Super, and LN-Ultra -- under the commercially permissive NVIDIA Open Model License Agreement. 2. We release the complete post-training dataset: Llama-Nemotron-Post-Training-Dataset. 3. We also release our training codebases: NeMo, NeMo-Aligner, and Megatron-LM.


Don't be lazy: CompleteP enables compute-efficient deep transformers

Abstract

arXiv:2505.01618v2 Announce Type: replace-cross Abstract: We study compute efficiency of LLM training when using different parameterizations, i.e., rules for adjusting model and optimizer hyperparameters (HPs) as model size changes. Some parameterizations fail to transfer optimal base HPs (such as learning rate) across changes in model depth, requiring practitioners to either re-tune these HPs as they scale up (expensive), or accept sub-optimal training when re-tuning is prohibitive. Even when they achieve HP transfer, we develop theory to show parameterizations may still exist in the lazy learning regime where layers learn only features close to their linearization, preventing effective use of depth and nonlinearity. Finally, we identify and adopt the parameterization we call CompleteP that achieves both depth-wise HP transfer and non-lazy learning in all layers. CompleteP enables a wider range of model width/depth ratios to remain compute-efficient, unlocking shapes better suited for different hardware settings and operational contexts. Moreover, CompleteP enables 12-34% compute efficiency improvements over the prior state-of-the-art.

摘要

我们研究了在使用不同参数化方法(即随模型规模变化调整模型和优化器超参数的规则)时大语言模型训练的计算效率。某些参数化方法无法在模型深度变化时传递最优基础超参数(如学习率),迫使实践者要么在扩大规模时重新调整这些超参数(成本高昂),要么在无法重新调整时接受次优训练。即使实现了超参数传递,我们通过理论分析发现参数化方法仍可能处于惰性学习状态——各层仅学习接近其线性化的特征,从而无法有效利用深度和非线性。最终,我们确定并采用名为CompleteP的参数化方法,该方法在所有层中既能实现深度方向的超参数传递,又能避免惰性学习。CompleteP使更广范围的模型宽度/深度比例保持计算效率,从而解锁更适合不同硬件设置和操作场景的模型架构。此外,与现有最优方法相比,CompleteP实现了12-34%的计算效率提升。


Harden and Catch for Just-in-Time Assured LLM-Based Software Testing: Open Research Challenges

Abstract

arXiv:2504.16472v2 Announce Type: replace-cross Abstract: Despite decades of research and practice in automated software testing, several fundamental concepts remain ill-defined and under-explored, yet offer enormous potential real-world impact. We show that these concepts raise exciting new challenges in the context of Large Language Models for software test generation. More specifically, we formally define and investigate the properties of hardening and catching tests. A hardening test is one that seeks to protect against future regressions, while a catching test is one that catches such a regression or a fault in new functionality introduced by a code change. Hardening tests can be generated at any time and may become catching tests when a future regression is caught. We also define and motivate the Catching 'Just-in-Time' (JiTTest) Challenge, in which tests are generated 'just-in-time' to catch new faults before they land into production. We show that any solution to Catching JiTTest generation can also be repurposed to catch latent faults in legacy code. We enumerate possible outcomes for hardening and catching tests and JiTTests, and discuss open research problems, deployment options, and initial results from our work on automated LLM-based hardening at Meta. This paper was written to accompany the keynote by the authors at the ACM International Conference on the Foundations of Software Engineering (FSE) 2025. Author order is alphabetical. The corresponding author is Mark Harman.

摘要

尽管自动化软件测试领域已历经数十年的研究与实践,若干基础性概念仍缺乏明确定义且未被充分探索,而这些概念蕴含着巨大的现实应用潜力。本文揭示了这些概念在基于大语言模型的软件测试生成背景下所引发的一系列激动人心的新挑战。具体而言,我们通过形式化方法定义并研究了硬化测试与捕获测试的特性:硬化测试旨在防范未来的回归缺陷,而捕获测试则用于捕捉由代码变更引发的回归缺陷或新功能缺陷。硬化测试可随时生成,并在未来捕获回归缺陷时转化为捕获测试。我们还提出并论证了'即时捕获测试'(JiTTest)挑战,其核心在于在故障进入生产环境前'即时'生成测试用例加以拦截。研究表明,任何针对即时捕获测试生成的解决方案均可被复用于捕捉遗留代码中的潜在缺陷。我们系统阐述了硬化测试、捕获测试及即时测试的可能结果,探讨了待解的研究问题、部署方案,并展示了在Meta公司基于大语言模型的自动化硬化测试初步成果。本文系作者在2025年ACM国际软件工程基础研讨会(FSE)主题报告的配套论文。作者按姓氏字母排序,通讯作者为Mark Harman。


Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering

Abstract

arXiv:2503.11197v4 Announce Type: replace-cross Abstract: Recently, reinforcement learning (RL) has been shown to greatly enhance the reasoning capabilities of large language models (LLMs), and RL-based approaches have been progressively applied to visual multimodal tasks. However, the audio modality has largely been overlooked in these developments. Thus, we conduct a series of RL explorations in audio understanding and reasoning, specifically focusing on the audio question answering (AQA) task. We leverage the group relative policy optimization (GRPO) algorithm to Qwen2-Audio-7B-Instruct, and our experiments demonstrated state-of-the-art performance on the MMAU Test-mini benchmark, achieving an accuracy rate of 64.5%. The main findings in this technical report are as follows: 1) The GRPO algorithm can be effectively applied to large audio language models (LALMs), even when the model has only 8.2B parameters; 2) With only 38k post-training samples, RL significantly outperforms supervised fine-tuning (SFT), indicating that RL-based approaches can be effective without large datasets; 3) The explicit reasoning process has not shown significant benefits for AQA tasks, and how to efficiently utilize deep thinking remains an open question for further research; 4) LALMs still lag far behind humans auditory-language reasoning, suggesting that the RL-based approaches warrant further exploration. Our project is available at https://github.com/xiaomi-research/r1-aqa and https://huggingface.co/mispeech/r1-aqa.

摘要

近期,强化学习(RL)被证明能显著增强大语言模型(LLMs)的推理能力,基于RL的方法已逐步应用于视觉多模态任务。然而,音频模态在这些进展中 largely 被忽视。为此,我们在音频理解与推理领域开展了一系列RL探索,特别聚焦于音频问答(AQA)任务。我们采用组相对策略优化(GRPO)算法对Qwen2-Audio-7B-Instruct进行优化,实验表明该方法在MMAU Test-mini基准测试中实现了64.5%的准确率,达到最先进性能。本技术报告的主要发现如下:1)GRPO算法可有效应用于大型音频语言模型(LALMs),即使模型仅有82亿参数;2)仅使用38k训练后样本,RL即显著超越监督微调(SFT),表明基于RL的方法无需大数据集即可生效;3)显式推理过程对AQA任务未显现显著益处,如何高效利用深度思考仍是待研究的开放问题;4)LALMs在听觉-语言推理方面仍远落后于人类,暗示基于RL的方法需进一步探索。项目详见https://github.com/xiaomi-research/r1-aqa与https://huggingface.co/mispeech/r1-aqa。


Simulating and Analysing Human Survey Responses with Large Language Models: A Case Study in Energy Stated Preference

Abstract

arXiv:2503.10652v2 Announce Type: replace-cross Abstract: Survey research plays a crucial role in studies by capturing consumer preferences and informing policy decisions. Stated preference (SP) surveys help researchers understand how individuals make trade-offs in hypothetical, potentially futuristic, scenarios. However, traditional methods are costly, time-consuming, and affected by respondent fatigue and ethical constraints. Large language models (LLMs) have shown remarkable capabilities in generating human-like responses, prompting interest in their use in survey research. This study investigates LLMs for simulating consumer choices in energy-related SP surveys and explores their integration into data collection and analysis workflows. Test scenarios were designed to assess the simulation performance of several LLMs (LLaMA 3.1, Mistral, GPT-3.5, DeepSeek-R1) at individual and aggregated levels, considering prompt design, in-context learning (ICL), chain-of-thought (CoT) reasoning, model types, integration with traditional choice models, and potential biases. While LLMs achieve accuracy above random guessing, performance remains insufficient for practical simulation use. Cloud-based LLMs do not consistently outperform smaller local models. DeepSeek-R1 achieves the highest average accuracy (77%) and outperforms non-reasoning LLMs in accuracy, factor identification, and choice distribution alignment. Previous SP choices are the most effective input; longer prompts with more factors reduce accuracy. Mixed logit models can support LLM prompt refinement. Reasoning LLMs show potential in data analysis by indicating factor significance, offering a qualitative complement to statistical models. Despite limitations, pre-trained LLMs offer scalability and require minimal historical data. Future work should refine prompts, further explore CoT reasoning, and investigate fine-tuning techniques.

摘要

调查研究通过捕捉消费者偏好和指导政策决策在研究中发挥着关键作用。陈述偏好(SP)调查帮助研究者理解个体如何在假设性、可能具有未来特征的场景中进行权衡。然而传统方法成本高昂、耗时且受受访者疲劳和伦理约束影响。大型语言模型(LLMs)在生成类人响应方面展现出卓越能力,引发了其在调查研究中应用的兴趣。本研究探讨LLMs在能源相关SP调查中模拟消费者选择的可行性,并探索其与数据收集分析工作流的整合。测试场景设计用于评估多个LLMs(LLaMA 3.1、Mistral、GPT-3.5、DeepSeek-R1)在个体和聚合层面的模拟性能,考虑提示设计、上下文学习(ICL)、思维链(CoT)推理、模型类型与传统选择模型的整合及潜在偏差。虽然LLMs准确率高于随机猜测,但其性能仍不足以满足实际模拟需求。云端LLMs并未持续优于小型本地模型。DeepSeek-R1以77%的平均准确率表现最佳,在准确性、因素识别和选择分布对齐方面优于非推理型LLMs。既往SP选择是最有效的输入;包含更多因素的较长提示会降低准确率。混合logit模型可支持LLM提示优化。推理型LLMs通过指示因素显著性在数据分析中展现出潜力,为统计模型提供定性补充。尽管存在局限,预训练LLMs具有可扩展性且只需极少历史数据。未来工作应优化提示设计、深入探索CoT推理并研究微调技术。